date:20071212

Re: [OMPI devel] SCTP noisy failure

2007-12-12 Thread Brad Penoff

On Dec 12, 2007 6:03 PM, Jeff Squyres  wrote:
> On Dec 12, 2007, at 8:58 PM, Brad Penoff wrote:
>
> >> That's not really the issue: I don't *want* SCTP support.  :)
> >>
> >> I have a default RHEL4U4 install and now Open MPI is complaining on a
> >> default mpirun.  Open MPI should work out of the box -- warning free
> >> -- on all supported operating systems.
> >
> > Haha, I caught that part as well (about the exclusivity "fix").  I was
> > just curious why the error is there in the first place because, after
> > all, everyone should want SCTP support, right ;-) ?
>
> ;-)
>
> > I didn't know
> > that any Linux distro had lksctp-tools installed by default, but the
> > module not loaded... learn something new every day though.
>
> Gotta love those screwy software authors!
>
> (I'm sure lots of people say that about us, too :-) )
>
> > So there's two issues (exclusivity not working as expected and then
> > the SCTP failure if you actually wanted SCTP support) and I'm
> > concerned about the one that most of you are not, I'm guessing ;-).
>
>
> I think exclusivity *is* working -- this is before that comes into
> play, IIRC.  The _init function is querying your BTL to see if it
> wants to run.

OK, I've done a commit (r16951) to make it less noisy.  Let me know
how it goes because I can't reproduce this at the moment.

I'm really curious about this release of Red Hat though

On my Ubuntu system, if I try to create an SCTP socket and the kernel
module isn't loaded/modprobe'd, the system loads it automatically
based off the entry in the modules alias file which tells the system
where to find the appropriate .ko file.

These mappings seem to be messed up in the Red Hat out-of-the-box
RHEL4U4 release though, if I'm understanding things correctly...  does
anyone know if this is is a known problem for this particular distro?
If not, I'll try to get to installing and playing with this release
eventually (working on Solaris support for the SCTP BTL (1-1) and a
few other non-OpenMPI things at the moment though...).

Thanks,
brad

>
> --
>
> Jeff Squyres
> Cisco Systems
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>

Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Jeff Squyres


Tarballs available at:

http://www.open-mpi.org/~jsquyres/unofficial/



On Dec 12, 2007, at 4:08 PM, Jeff Squyres (jsquyres) wrote:

Heh, ok.  I'll make a tarball against your patch later.  Its against  
the trunk?


-jms
Sent from my PDA

 -Original Message-
From:   Gleb Natapov [mailto:gl...@voltaire.com]
Sent:   Wednesday, December 12, 2007 03:54 PM Eastern Standard Time
To: Open MPI Developers
Subject:Re: [OMPI devel] matching code rewrite in OB1

On Wed, Dec 12, 2007 at 03:52:17PM -0500, Jeff Squyres wrote:
> On Dec 12, 2007, at 3:20 PM, Gleb Natapov wrote:
>
> >> How about making a tarball with this patch in it that can be  
thrown

> >> at
> >> everyone's MTT? (we can put the tarball on www.open-mpi.org
> >> somewhere)
> > I don't have access to www.open-mpi.org, but I can send you the  
patch.

> > I can send you a tarball too, but I prefer to not abuse email.
>
> Do you have access to staging.openfabrics.org?  I could download it
> from there and put it on www.open-mpi.org.
>
No. I don't :(

--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] SCTP noisy failure

2007-12-12 Thread Jeff Squyres


On Dec 12, 2007, at 8:58 PM, Brad Penoff wrote:


That's not really the issue: I don't *want* SCTP support.  :)

I have a default RHEL4U4 install and now Open MPI is complaining on a
default mpirun.  Open MPI should work out of the box -- warning free
-- on all supported operating systems.


Haha, I caught that part as well (about the exclusivity "fix").  I was
just curious why the error is there in the first place because, after
all, everyone should want SCTP support, right ;-) ?


;-)


I didn't know
that any Linux distro had lksctp-tools installed by default, but the
module not loaded... learn something new every day though.


Gotta love those screwy software authors!

(I'm sure lots of people say that about us, too :-) )


So there's two issues (exclusivity not working as expected and then
the SCTP failure if you actually wanted SCTP support) and I'm
concerned about the one that most of you are not, I'm guessing ;-).



I think exclusivity *is* working -- this is before that comes into  
play, IIRC.  The _init function is querying your BTL to see if it  
wants to run.


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] SCTP noisy failure

2007-12-12 Thread Brad Penoff

On Dec 12, 2007 5:44 PM, Jeff Squyres  wrote:
> On Dec 12, 2007, at 7:16 PM, Brad Penoff wrote:
>
> > Does your system have sctp in the kernel as a module?  This is the
> > default for most Linux systems so you may have to "modprobe sctp" to
> > get rid of the ESOCKTNOSUPPORT...
>
> That's not really the issue: I don't *want* SCTP support.  :)
>
> I have a default RHEL4U4 install and now Open MPI is complaining on a
> default mpirun.  Open MPI should work out of the box -- warning free
> -- on all supported operating systems.

Haha, I caught that part as well (about the exclusivity "fix").  I was
just curious why the error is there in the first place because, after
all, everyone should want SCTP support, right ;-) ?  I didn't know
that any Linux distro had lksctp-tools installed by default, but the
module not loaded... learn something new every day though.

So there's two issues (exclusivity not working as expected and then
the SCTP failure if you actually wanted SCTP support) and I'm
concerned about the one that most of you are not, I'm guessing ;-).

I'll try to look at the other problem too though...

brad

>
> --
>
> Jeff Squyres
> Cisco Systems
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>

Re: [OMPI devel] SCTP noisy failure

2007-12-12 Thread Jeff Squyres


On Dec 12, 2007, at 7:16 PM, Brad Penoff wrote:


Does your system have sctp in the kernel as a module?  This is the
default for most Linux systems so you may have to "modprobe sctp" to
get rid of the ESOCKTNOSUPPORT...


That's not really the issue: I don't *want* SCTP support.  :)

I have a default RHEL4U4 install and now Open MPI is complaining on a  
default mpirun.  Open MPI should work out of the box -- warning free  
-- on all supported operating systems.


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r16909 (f77_hello compiler error)

2007-12-12 Thread George Bosilca

The logic was wrong. I only get half of it. Commit 16950 solve the  
problem. Sorry for this.


  Thanks,
george.

On Dec 12, 2007, at 2:44 PM, Jeff Squyres wrote:


Yes -- something changed; I tested all 4 languages extensively before
I committed (but not on mac).  This fails for me on Linux as well;
I'll check into it...

On Dec 12, 2007, at 2:15 PM, Ethan Mallove wrote:


Hello,

Is this change (or r16908) causing the below error in the MTT
trivial test (f77_hello)? The error occurs on Solaris and
Linux.

...
NOTICE: Invoking /ws/ompi-tools/SUNWspro/SOS11/bin/f90 -f77 -ftrap=
%none -I/installs/cGmK/install/include/v9 -xarch=amd64
hello.f -o f77_hello -R/installs/cGmK/install/lib/amd64 -R/
opt/mx/lib -L/installs/cGmK/install/lib/amd64 -lmpi_f77 -
lmpi -lopen-rte -lopen-pal -lsocket -lnsl -lrt -lm
hello.f:
 MAIN main:
Undefined   first referenced
 symbol in file
intercept_extra_state_t_class   /installs/cGmK/install/
lib/amd64/libmpi_f77.so
ld: fatal: Symbol referencing errors. No output written to f77_hello

See also http://www.open-mpi.org/mtt/index.php?do_redir=475.

Didn't look that closely here, just noted the line change
involving "intercept_extra_state".

-Ethan


On Sun, Dec/09/2007 07:19:59PM, bosi...@osl.iu.edu wrote:

Author: bosilca
Date: 2007-12-09 19:19:58 EST (Sun, 09 Dec 2007)
New Revision: 16909
URL: https://svn.open-mpi.org/trac/ompi/changeset/16909

Log:
Avoid a compiler warning about the function being defined but not
used when we compile the profiling layer.

Text files modified:
 trunk/ompi/mpi/f77/register_datarep_f.c | 6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

Modified: trunk/ompi/mpi/f77/register_datarep_f.c
=
=
=
=
=
=
=
=
=
= 


--- trunk/ompi/mpi/f77/register_datarep_f.c (original)
+++ trunk/ompi/mpi/f77/register_datarep_f.c 2007-12-09 19:19:58 EST
(Sun, 09 Dec 2007)
@@ -90,6 +90,9 @@
   MPI_Aint *extra_state_f77;
} intercept_extra_state_t;

+OBJ_CLASS_DECLARATION(intercept_extra_state_t);
+
+#if !OMPI_PROFILE_LAYER
static void
intercept_extra_state_constructor(intercept_extra_state_t *obj)
{
   obj->read_fn_f77 = NULL;
@@ -98,9 +101,6 @@
   obj->extra_state_f77 = NULL;
}

-OBJ_CLASS_DECLARATION(intercept_extra_state_t);
-
-#if !OMPI_PROFILE_LAYER
OBJ_CLASS_INSTANCE(intercept_extra_state_t,
  opal_list_item_t,
  intercept_extra_state_constructor, NULL);
___
svn-full mailing list
svn-f...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn-full

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




smime.p7s
Description: S/MIME cryptographic signature

Re: [OMPI devel] SCTP noisy failure

2007-12-12 Thread Brad Penoff

hey Jeff,

Does your system have sctp in the kernel as a module?  This is the
default for most Linux systems so you may have to "modprobe sctp" to
get rid of the ESOCKTNOSUPPORT...

brad

On Dec 12, 2007 3:57 PM, Jeff Squyres  wrote:
> After the exclusivity change today, I notice that I am getting
> warnings for *every* mpirun from the SCTP BTL on RHEL4:
>
> [15:52] svbu-mpi:~/mpi % mpirun -np 2 hello
> [svbu-mpi.cisco.com][1,0][btl_sctp_component.c:
> 615:mca_btl_sctp_component_create_listen] socket() failed with errno=94
> [svbu-mpi.cisco.com][1,1][btl_sctp_component.c:
> 615:mca_btl_sctp_component_create_listen] socket() failed with errno=94
> Hello, world!  I am 0 of 2 (svbu-mpi.cisco.com)
> Hello, world!  I am 1 of 2 (svbu-mpi.cisco.com)
> [15:52] svbu-mpi:~/mpi %
>
> Can these be turned off?  I have a default RHEL4 system -- I haven't
> done anything special to enable / disable SCTP.  Is there a less noisy
> way to tell that SCTP is not enabled on a system?
>
> --
> Jeff Squyres
> Cisco Systems
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>

[OMPI devel] SCTP noisy failure

2007-12-12 Thread Jeff Squyres

After the exclusivity change today, I notice that I am getting  
warnings for *every* mpirun from the SCTP BTL on RHEL4:


[15:52] svbu-mpi:~/mpi % mpirun -np 2 hello
[svbu-mpi.cisco.com][1,0][btl_sctp_component.c: 
615:mca_btl_sctp_component_create_listen] socket() failed with errno=94
[svbu-mpi.cisco.com][1,1][btl_sctp_component.c: 
615:mca_btl_sctp_component_create_listen] socket() failed with errno=94

Hello, world!  I am 0 of 2 (svbu-mpi.cisco.com)
Hello, world!  I am 1 of 2 (svbu-mpi.cisco.com)
[15:52] svbu-mpi:~/mpi %

Can these be turned off?  I have a default RHEL4 system -- I haven't  
done anything special to enable / disable SCTP.  Is there a less noisy  
way to tell that SCTP is not enabled on a system?


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Jeff Squyres

Was Rich referring to ensuring that the test codes checked that their  
payloads were correct (and not re-assembled in the wrong order)?



On Dec 12, 2007, at 4:10 PM, Brian W. Barrett wrote:


On Wed, 12 Dec 2007, Gleb Natapov wrote:


On Wed, Dec 12, 2007 at 03:46:10PM -0500, Richard Graham wrote:
This is better than nothing, but really not very helpful for  
looking at the
specific issues that can arise with this, unless these systems  
have several
parallel networks, with tests that will generate a lot of parallel  
network
traffic, and be able to self check for out-of-order received -  
i.e. this
needs to be encoded into the payload for verification purposes.   
There are
some out-of-order scenarios that need to be generated and  
checked.  I think
that George may have a system that will be good for this sort of  
testing.



I am running various test with multiple networks right now. I use
several IB BTLs and TCP BTL simultaneously. I see many reordered
messages and all tests were OK till now, but they don't encode
message sequence in a payload as far as I know. I'll change one of
them to do so.


Other than Rich's comment that we need sequence numbers, why add  
them?  We
haven't had them for non-matching packets for the last 3 years in  
Open MPI

(ie, forever), and I can't see why we would need them.  Yes, we need
sequence numbers for match headers to make sure MPI ordering is  
correct.

But for the rest of the payload, there's no need with OMPI's datatype
engine.  It's just more payload for no gain.

Brian
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] New BTL parameter

2007-12-12 Thread Paul H. Hargrove


Gleb Natapov wrote:

On Wed, Dec 12, 2007 at 02:03:02PM -0500, Jeff Squyres wrote:
  

On Dec 9, 2007, at 10:34 AM, Gleb Natapov wrote:



 Currently BTL has parameter btl_min_send_size that is no longer used.
I want to change it to be btl_rndv_eager_limit. This new parameter  
will
determine a size of a first fragment of rendezvous protocol. Now we  
use

btl_eager_limit to set its size. btl_rndv_eager_limit will have to be
smaller or equal to btl_eager_limit. By default it will be equal to
btl_eager_limit so no behavior change will be observed if default is
used.
  
Can you describe why it would be better to have the value less than  
the eager limit?




It is just one more knob to tune OB1 algorithm. I sometimes don't want
to send any data by copy in/out at all. This is not possible right now.
With this new param I will be able to control this.
  


From my experience tuning RDMA-rendezvous for the GASNet communications 
library, I know that it was beneficial to piggyback some portion of the 
payload on the rendezvous request.  However, the best [insert your 
favorite performance metric here] was not always achieved by 
piggybacking the maximum that could be buffered at the receiver 
(equivalent of blt_eager_limit).  If I understand correctly, Gleb's 
btl_rndv_eager_limit parameter would allow tuning for this behavior in OMPI.


An artificial/simplified example would be if the eager limit is 32K and 
you have a 64K xfer.  Is it better to send 32K copy in/out plus 32K by 
RDMA, or to send 8K copy in/out plus 56K by RDMA?  If the memcpy() 
overhead for 32K of eager payload exceeds what can be overlapped with 
the rendezvous setup then the second may be the better choice (higher 
bandwidth, lower latency, and lower CPU overheads on both sender and 
receiver).


-Paul

--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Brian W. Barrett


On Wed, 12 Dec 2007, Gleb Natapov wrote:


On Wed, Dec 12, 2007 at 03:46:10PM -0500, Richard Graham wrote:

This is better than nothing, but really not very helpful for looking at the
specific issues that can arise with this, unless these systems have several
parallel networks, with tests that will generate a lot of parallel network
traffic, and be able to self check for out-of-order received - i.e. this
needs to be encoded into the payload for verification purposes.  There are
some out-of-order scenarios that need to be generated and checked.  I think
that George may have a system that will be good for this sort of testing.


I am running various test with multiple networks right now. I use
several IB BTLs and TCP BTL simultaneously. I see many reordered
messages and all tests were OK till now, but they don't encode
message sequence in a payload as far as I know. I'll change one of
them to do so.


Other than Rich's comment that we need sequence numbers, why add them?  We 
haven't had them for non-matching packets for the last 3 years in Open MPI 
(ie, forever), and I can't see why we would need them.  Yes, we need 
sequence numbers for match headers to make sure MPI ordering is correct. 
But for the rest of the payload, there's no need with OMPI's datatype 
engine.  It's just more payload for no gain.


Brian

Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Jeff Squyres (jsquyres)

Heh, ok.  I'll make a tarball against your patch later.  Its against the trunk?

-jms
Sent from my PDA

 -Original Message-
From:   Gleb Natapov [mailto:gl...@voltaire.com]
Sent:   Wednesday, December 12, 2007 03:54 PM Eastern Standard Time
To: Open MPI Developers
Subject:Re: [OMPI devel] matching code rewrite in OB1

On Wed, Dec 12, 2007 at 03:52:17PM -0500, Jeff Squyres wrote:
> On Dec 12, 2007, at 3:20 PM, Gleb Natapov wrote:
> 
> >> How about making a tarball with this patch in it that can be thrown  
> >> at
> >> everyone's MTT? (we can put the tarball on www.open-mpi.org  
> >> somewhere)
> > I don't have access to www.open-mpi.org, but I can send you the patch.
> > I can send you a tarball too, but I prefer to not abuse email.
> 
> Do you have access to staging.openfabrics.org?  I could download it  
> from there and put it on www.open-mpi.org.
> 
No. I don't :(

--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Gleb Natapov

On Wed, Dec 12, 2007 at 03:52:17PM -0500, Jeff Squyres wrote:
> On Dec 12, 2007, at 3:20 PM, Gleb Natapov wrote:
> 
> >> How about making a tarball with this patch in it that can be thrown  
> >> at
> >> everyone's MTT? (we can put the tarball on www.open-mpi.org  
> >> somewhere)
> > I don't have access to www.open-mpi.org, but I can send you the patch.
> > I can send you a tarball too, but I prefer to not abuse email.
> 
> Do you have access to staging.openfabrics.org?  I could download it  
> from there and put it on www.open-mpi.org.
> 
No. I don't :(

--
Gleb.

Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Jeff Squyres


On Dec 12, 2007, at 3:20 PM, Gleb Natapov wrote:

How about making a tarball with this patch in it that can be thrown  
at
everyone's MTT? (we can put the tarball on www.open-mpi.org  
somewhere)

I don't have access to www.open-mpi.org, but I can send you the patch.
I can send you a tarball too, but I prefer to not abuse email.


Do you have access to staging.openfabrics.org?  I could download it  
from there and put it on www.open-mpi.org.


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Gleb Natapov

On Wed, Dec 12, 2007 at 03:46:10PM -0500, Richard Graham wrote:
> This is better than nothing, but really not very helpful for looking at the
> specific issues that can arise with this, unless these systems have several
> parallel networks, with tests that will generate a lot of parallel network
> traffic, and be able to self check for out-of-order received - i.e. this
> needs to be encoded into the payload for verification purposes.  There are
> some out-of-order scenarios that need to be generated and checked.  I think
> that George may have a system that will be good for this sort of testing.
> 
I am running various test with multiple networks right now. I use
several IB BTLs and TCP BTL simultaneously. I see many reordered
messages and all tests were OK till now, but they don't encode
message sequence in a payload as far as I know. I'll change one of
them to do so.

> Rich
> 
> 
> On 12/12/07 3:20 PM, "Gleb Natapov"  wrote:
> 
> > On Wed, Dec 12, 2007 at 11:57:11AM -0500, Jeff Squyres wrote:
> >> Gleb --
> >> 
> >> How about making a tarball with this patch in it that can be thrown at
> >> everyone's MTT? (we can put the tarball on www.open-mpi.org somewhere)
> > I don't have access to www.open-mpi.org, but I can send you the patch.
> > I can send you a tarball too, but I prefer to not abuse email.
> > 
> >> 
> >> 
> >> On Dec 11, 2007, at 4:14 PM, Richard Graham wrote:
> >> 
> >>> I will re-iterate my concern.  The code that is there now is mostly
> >>> nine
> >>> years old (with some mods made when it was brought over to Open
> >>> MPI).  It
> >>> took about 2 months of testing on systems with 5-13 way network
> >>> parallelism
> >>> to track down all KNOWN race conditions.  This code is at the center
> >>> of MPI
> >>> correctness, so I am VERY concerned about changing it w/o some very
> >>> strong
> >>> reasons.  Not apposed, just very cautious.
> >>> 
> >>> Rich
> >>> 
> >>> 
> >>> On 12/11/07 11:47 AM, "Gleb Natapov"  wrote:
> >>> 
>  On Tue, Dec 11, 2007 at 08:36:42AM -0800, Andrew Friedley wrote:
> > Possibly, though I have results from a benchmark I've written
> > indicating
> > the reordering happens at the sender.  I believe I found it was
> > due to
> > the QP striping trick I use to get more bandwidth -- if you back
> > down to
> > one QP (there's a define in the code you can change), the reordering
> > rate drops.
>  Ah, OK. My assumption was just from looking into code, so I may be
>  wrong.
>  
> > 
> > Also I do not make any recursive calls to progress -- at least not
> > directly in the BTL; I can't speak for the upper layers.  The
> > reason I
> > do many completions at once is that it is a big help in turning
> > around
> > receive buffers, making it harder to run out of buffers and drop
> > frags.
> >  I want to say there was some performance benefit as well but I
> > can't
> > say for sure.
>  Currently upper layers of Open MPI may call BTL progress function
>  recursively. I hope this will change some day.
>  
> > 
> > Andrew
> > 
> > Gleb Natapov wrote:
> >> On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote:
> >>> Try UD, frags are reordered at a very high rate so should be a
> >>> good test.
> >> Good Idea I'll try this. BTW I thing the reason for such a high
> >> rate of
> >> reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions
> >> (500) and process them one by one and if progress function is
> >> called
> >> recursively next 500 completion will be reordered versus previous
> >> completions (reordering happens on a receiver, not sender).
> >> 
> >>> Andrew
> >>> 
> >>> Richard Graham wrote:
>  Gleb,
>   I would suggest that before this is checked in this be tested
>  on a
>  system
>  that has N-way network parallelism, where N is as large as you
>  can find.
>  This is a key bit of code for MPI correctness, and out-of-order
>  operations
>  will break it, so you want to maximize the chance for such
>  operations.
>  
>  Rich
>  
>  
>  On 12/11/07 10:54 AM, "Gleb Natapov"  wrote:
>  
> > Hi,
> > 
> >   I did a rewrite of matching code in OB1. I made it much
> > simpler and 2
> > times smaller (which is good, less code - less bugs). I also
> > got rid
> > of huge macros - very helpful if you need to debug something.
> > There
> > is no performance degradation, actually I even see very small
> > performance
> > improvement. I ran MTT with this patch and the result is the
> > same as on
> > trunk. I would like to commit this to the trunk. The patch is
> > attached
> > for everybody to try.
> > 
> > --
> > Gleb

Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Richard Graham

This is better than nothing, but really not very helpful for looking at the
specific issues that can arise with this, unless these systems have several
parallel networks, with tests that will generate a lot of parallel network
traffic, and be able to self check for out-of-order received - i.e. this
needs to be encoded into the payload for verification purposes.  There are
some out-of-order scenarios that need to be generated and checked.  I think
that George may have a system that will be good for this sort of testing.

Rich


On 12/12/07 3:20 PM, "Gleb Natapov"  wrote:

> On Wed, Dec 12, 2007 at 11:57:11AM -0500, Jeff Squyres wrote:
>> Gleb --
>> 
>> How about making a tarball with this patch in it that can be thrown at
>> everyone's MTT? (we can put the tarball on www.open-mpi.org somewhere)
> I don't have access to www.open-mpi.org, but I can send you the patch.
> I can send you a tarball too, but I prefer to not abuse email.
> 
>> 
>> 
>> On Dec 11, 2007, at 4:14 PM, Richard Graham wrote:
>> 
>>> I will re-iterate my concern.  The code that is there now is mostly
>>> nine
>>> years old (with some mods made when it was brought over to Open
>>> MPI).  It
>>> took about 2 months of testing on systems with 5-13 way network
>>> parallelism
>>> to track down all KNOWN race conditions.  This code is at the center
>>> of MPI
>>> correctness, so I am VERY concerned about changing it w/o some very
>>> strong
>>> reasons.  Not apposed, just very cautious.
>>> 
>>> Rich
>>> 
>>> 
>>> On 12/11/07 11:47 AM, "Gleb Natapov"  wrote:
>>> 
 On Tue, Dec 11, 2007 at 08:36:42AM -0800, Andrew Friedley wrote:
> Possibly, though I have results from a benchmark I've written
> indicating
> the reordering happens at the sender.  I believe I found it was
> due to
> the QP striping trick I use to get more bandwidth -- if you back
> down to
> one QP (there's a define in the code you can change), the reordering
> rate drops.
 Ah, OK. My assumption was just from looking into code, so I may be
 wrong.
 
> 
> Also I do not make any recursive calls to progress -- at least not
> directly in the BTL; I can't speak for the upper layers.  The
> reason I
> do many completions at once is that it is a big help in turning
> around
> receive buffers, making it harder to run out of buffers and drop
> frags.
>  I want to say there was some performance benefit as well but I
> can't
> say for sure.
 Currently upper layers of Open MPI may call BTL progress function
 recursively. I hope this will change some day.
 
> 
> Andrew
> 
> Gleb Natapov wrote:
>> On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote:
>>> Try UD, frags are reordered at a very high rate so should be a
>>> good test.
>> Good Idea I'll try this. BTW I thing the reason for such a high
>> rate of
>> reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions
>> (500) and process them one by one and if progress function is
>> called
>> recursively next 500 completion will be reordered versus previous
>> completions (reordering happens on a receiver, not sender).
>> 
>>> Andrew
>>> 
>>> Richard Graham wrote:
 Gleb,
  I would suggest that before this is checked in this be tested
 on a
 system
 that has N-way network parallelism, where N is as large as you
 can find.
 This is a key bit of code for MPI correctness, and out-of-order
 operations
 will break it, so you want to maximize the chance for such
 operations.
 
 Rich
 
 
 On 12/11/07 10:54 AM, "Gleb Natapov"  wrote:
 
> Hi,
> 
>   I did a rewrite of matching code in OB1. I made it much
> simpler and 2
> times smaller (which is good, less code - less bugs). I also
> got rid
> of huge macros - very helpful if you need to debug something.
> There
> is no performance degradation, actually I even see very small
> performance
> improvement. I ran MTT with this patch and the result is the
> same as on
> trunk. I would like to commit this to the trunk. The patch is
> attached
> for everybody to try.
> 
> --
> Gleb.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> --
>> Gleb.
>> __

Re: [OMPI devel] [PATCH] openib: clean-up connect to allow for new cm's

2007-12-12 Thread Jon Mason

On Wed, Dec 12, 2007 at 01:35:33PM -0500, Jeff Squyres wrote:
> I agree with Gleb's idea.  More below.
> 
> On Dec 12, 2007, at 12:24 PM, Jon Mason wrote:
> 
> > Ok, glad I got this conversation started :)
> >
> > So, we need a slight redesign to determine the cm method (unless  
> > forced
> > via commandline arg).  This can be determined by calling all the
> > individual open routines, and having them return a priority based on
> > their ability to function.  For example, the xoob open function will
> > check the mca_btl_openib_component.num_xrc_qps for a non-zero value  
> > and
> > return the priority based on that.
> >
> > Of course, if forced then it will only call that specific open  
> > function
> > and throw any relevant errors as necessary.
> 
> 
> Close, but I'd do it slightly differently:
> 
> - open() is *only* used for creating MCA params.  It's a bad name, but  
> it's unfortunately the precedent throughout the rest of the OMPI code  
> base.  :-\ (it has roots in the ompi_info command -- ompi_info has to  
> be able to get a full list of all MCA params regardless of what  
> hardware is available on the current system)
> 
> - during the openib component startup, we should add a query()  
> function that does what you describe.  I.e., we query() each endpoint  
> and it either returns a valid priority or "I don't want to be used  
> with this endpoint."
> 
> - there should be a priority MCA param for every CPC.  Perhaps the CPC  
> base can handle this...?  I'm not sure; it may need to be down in each  
> CPC.
> 
> - the list of CPCs that want to run with each endpoint are ordered by  
> priority (ties will be arbitrarily, but deterministically, broken --  
> alphabetical?) and sent around in the modex.
> 
> - when a new connection comes up, the intersection of the CPC lists  
> for the near and far endpoints is computed and the highest priority  
> CPC is used to make the connection.  Since everyone has the same data,  
> both sides will make the same decision.
> 
> - CPC init may have to change a bit -- more than one CPC may be used  
> for a given endpoint because both the local module and the remote  
> module are involved in making the decision of which CPC is used.
> 
> After this first cut is done, we should probably also add  
> btl_openib_cpc_include and btl_openib_cpc_exclude as I described in a  
> prior mail (just like *_if_include and *_if_exclude in several BTLs)  
> to include/exclude sets of CPCs at run-time.
> 
> > If this sounds sane, then let me know and I'll start coding it up.
> 
> 
> This has actually been on my to-do list for too long; if you have the  
> cycles to do this now, it would be great...

Since I need to have it done before I can do my rdma_cm bits, I'll add
this to my queue and get started immediately.

> 
> I'll make you a bargain: if you do the stuff above, I'll add in the  
> configure/build mojo for selectively compiling the XOOB CPC or not  
> (depending on whether the underlying system has XRC library support or  
> not).  Cool?
> 
> Let's go off on a /tmp-public branch for this so we don't hose the  
> trunk...  I just made /tmp-public/openib-cpc.
> 
> -- 
> Jeff Squyres
> Cisco Systems
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Gleb Natapov

On Wed, Dec 12, 2007 at 11:57:11AM -0500, Jeff Squyres wrote:
> Gleb --
> 
> How about making a tarball with this patch in it that can be thrown at  
> everyone's MTT? (we can put the tarball on www.open-mpi.org somewhere)
I don't have access to www.open-mpi.org, but I can send you the patch.
I can send you a tarball too, but I prefer to not abuse email.

> 
> 
> On Dec 11, 2007, at 4:14 PM, Richard Graham wrote:
> 
> > I will re-iterate my concern.  The code that is there now is mostly  
> > nine
> > years old (with some mods made when it was brought over to Open  
> > MPI).  It
> > took about 2 months of testing on systems with 5-13 way network  
> > parallelism
> > to track down all KNOWN race conditions.  This code is at the center  
> > of MPI
> > correctness, so I am VERY concerned about changing it w/o some very  
> > strong
> > reasons.  Not apposed, just very cautious.
> >
> > Rich
> >
> >
> > On 12/11/07 11:47 AM, "Gleb Natapov"  wrote:
> >
> >> On Tue, Dec 11, 2007 at 08:36:42AM -0800, Andrew Friedley wrote:
> >>> Possibly, though I have results from a benchmark I've written  
> >>> indicating
> >>> the reordering happens at the sender.  I believe I found it was  
> >>> due to
> >>> the QP striping trick I use to get more bandwidth -- if you back  
> >>> down to
> >>> one QP (there's a define in the code you can change), the reordering
> >>> rate drops.
> >> Ah, OK. My assumption was just from looking into code, so I may be
> >> wrong.
> >>
> >>>
> >>> Also I do not make any recursive calls to progress -- at least not
> >>> directly in the BTL; I can't speak for the upper layers.  The  
> >>> reason I
> >>> do many completions at once is that it is a big help in turning  
> >>> around
> >>> receive buffers, making it harder to run out of buffers and drop  
> >>> frags.
> >>>  I want to say there was some performance benefit as well but I  
> >>> can't
> >>> say for sure.
> >> Currently upper layers of Open MPI may call BTL progress function
> >> recursively. I hope this will change some day.
> >>
> >>>
> >>> Andrew
> >>>
> >>> Gleb Natapov wrote:
>  On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote:
> > Try UD, frags are reordered at a very high rate so should be a  
> > good test.
>  Good Idea I'll try this. BTW I thing the reason for such a high  
>  rate of
>  reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions
>  (500) and process them one by one and if progress function is  
>  called
>  recursively next 500 completion will be reordered versus previous
>  completions (reordering happens on a receiver, not sender).
> 
> > Andrew
> >
> > Richard Graham wrote:
> >> Gleb,
> >>  I would suggest that before this is checked in this be tested  
> >> on a
> >> system
> >> that has N-way network parallelism, where N is as large as you  
> >> can find.
> >> This is a key bit of code for MPI correctness, and out-of-order  
> >> operations
> >> will break it, so you want to maximize the chance for such  
> >> operations.
> >>
> >> Rich
> >>
> >>
> >> On 12/11/07 10:54 AM, "Gleb Natapov"  wrote:
> >>
> >>> Hi,
> >>>
> >>>   I did a rewrite of matching code in OB1. I made it much  
> >>> simpler and 2
> >>> times smaller (which is good, less code - less bugs). I also  
> >>> got rid
> >>> of huge macros - very helpful if you need to debug something.  
> >>> There
> >>> is no performance degradation, actually I even see very small  
> >>> performance
> >>> improvement. I ran MTT with this patch and the result is the  
> >>> same as on
> >>> trunk. I would like to commit this to the trunk. The patch is  
> >>> attached
> >>> for everybody to try.
> >>>
> >>> --
> >>> Gleb.
> >>> ___
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
>  --
>  Gleb.
>  ___
>  devel mailing list
>  de...@open-mpi.org
>  http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> ___
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >> --
> >> Gleb.
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > ___
> > devel mailing list
> >

Re: [OMPI devel] New BTL parameter

2007-12-12 Thread Gleb Natapov

On Wed, Dec 12, 2007 at 02:03:02PM -0500, Jeff Squyres wrote:
> On Dec 9, 2007, at 10:34 AM, Gleb Natapov wrote:
> 
> >  Currently BTL has parameter btl_min_send_size that is no longer used.
> > I want to change it to be btl_rndv_eager_limit. This new parameter  
> > will
> > determine a size of a first fragment of rendezvous protocol. Now we  
> > use
> > btl_eager_limit to set its size. btl_rndv_eager_limit will have to be
> > smaller or equal to btl_eager_limit. By default it will be equal to
> > btl_eager_limit so no behavior change will be observed if default is
> > used.
> 
> 
> Can you describe why it would be better to have the value less than  
> the eager limit?
> 
It is just one more knob to tune OB1 algorithm. I sometimes don't want
to send any data by copy in/out at all. This is not possible right now.
With this new param I will be able to control this.

--
Gleb.

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r16909 (f77_hello compiler error)

2007-12-12 Thread Jeff Squyres

Yes -- something changed; I tested all 4 languages extensively before  
I committed (but not on mac).  This fails for me on Linux as well;  
I'll check into it...


On Dec 12, 2007, at 2:15 PM, Ethan Mallove wrote:


Hello,

Is this change (or r16908) causing the below error in the MTT
trivial test (f77_hello)? The error occurs on Solaris and
Linux.

 ...
 NOTICE: Invoking /ws/ompi-tools/SUNWspro/SOS11/bin/f90 -f77 -ftrap= 
%none -I/installs/cGmK/install/include/v9 -xarch=amd64  
hello.f -o f77_hello -R/installs/cGmK/install/lib/amd64 -R/ 
opt/mx/lib -L/installs/cGmK/install/lib/amd64 -lmpi_f77 - 
lmpi -lopen-rte -lopen-pal -lsocket -lnsl -lrt -lm

 hello.f:
  MAIN main:
 Undefined  first referenced
  symbolin file
 intercept_extra_state_t_class   /installs/cGmK/install/ 
lib/amd64/libmpi_f77.so

 ld: fatal: Symbol referencing errors. No output written to f77_hello

See also http://www.open-mpi.org/mtt/index.php?do_redir=475.

Didn't look that closely here, just noted the line change
involving "intercept_extra_state".

-Ethan


On Sun, Dec/09/2007 07:19:59PM, bosi...@osl.iu.edu wrote:

Author: bosilca
Date: 2007-12-09 19:19:58 EST (Sun, 09 Dec 2007)
New Revision: 16909
URL: https://svn.open-mpi.org/trac/ompi/changeset/16909

Log:
Avoid a compiler warning about the function being defined but not
used when we compile the profiling layer.

Text files modified:
  trunk/ompi/mpi/f77/register_datarep_f.c | 6 +++---
  1 files changed, 3 insertions(+), 3 deletions(-)

Modified: trunk/ompi/mpi/f77/register_datarep_f.c
= 
= 
= 
= 
= 
= 
= 
= 
= 
=

--- trunk/ompi/mpi/f77/register_datarep_f.c (original)
+++ trunk/ompi/mpi/f77/register_datarep_f.c	2007-12-09 19:19:58 EST  
(Sun, 09 Dec 2007)

@@ -90,6 +90,9 @@
MPI_Aint *extra_state_f77;
} intercept_extra_state_t;

+OBJ_CLASS_DECLARATION(intercept_extra_state_t);
+
+#if !OMPI_PROFILE_LAYER
static void  
intercept_extra_state_constructor(intercept_extra_state_t *obj)

{
obj->read_fn_f77 = NULL;
@@ -98,9 +101,6 @@
obj->extra_state_f77 = NULL;
}

-OBJ_CLASS_DECLARATION(intercept_extra_state_t);
-
-#if !OMPI_PROFILE_LAYER
OBJ_CLASS_INSTANCE(intercept_extra_state_t,
   opal_list_item_t,
   intercept_extra_state_constructor, NULL);
___
svn-full mailing list
svn-f...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn-full

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r16909 (f77_hello compiler error)

2007-12-12 Thread Ethan Mallove

Hello,

Is this change (or r16908) causing the below error in the MTT
trivial test (f77_hello)? The error occurs on Solaris and
Linux.

  ...
  NOTICE: Invoking /ws/ompi-tools/SUNWspro/SOS11/bin/f90 -f77 -ftrap=%none 
-I/installs/cGmK/install/include/v9 -xarch=amd64 hello.f -o f77_hello 
-R/installs/cGmK/install/lib/amd64 -R/opt/mx/lib 
-L/installs/cGmK/install/lib/amd64 -lmpi_f77 -lmpi -lopen-rte 
-lopen-pal -lsocket -lnsl -lrt -lm
  hello.f:
   MAIN main:
  Undefined first referenced
   symbol   in file
  intercept_extra_state_t_class   
/installs/cGmK/install/lib/amd64/libmpi_f77.so
  ld: fatal: Symbol referencing errors. No output written to f77_hello

See also http://www.open-mpi.org/mtt/index.php?do_redir=475.

Didn't look that closely here, just noted the line change
involving "intercept_extra_state".

-Ethan


On Sun, Dec/09/2007 07:19:59PM, bosi...@osl.iu.edu wrote:
> Author: bosilca
> Date: 2007-12-09 19:19:58 EST (Sun, 09 Dec 2007)
> New Revision: 16909
> URL: https://svn.open-mpi.org/trac/ompi/changeset/16909
> 
> Log:
> Avoid a compiler warning about the function being defined but not
> used when we compile the profiling layer.
> 
> Text files modified: 
>trunk/ompi/mpi/f77/register_datarep_f.c | 6 +++--- 
>  
>1 files changed, 3 insertions(+), 3 deletions(-)
> 
> Modified: trunk/ompi/mpi/f77/register_datarep_f.c
> ==
> --- trunk/ompi/mpi/f77/register_datarep_f.c   (original)
> +++ trunk/ompi/mpi/f77/register_datarep_f.c   2007-12-09 19:19:58 EST (Sun, 
> 09 Dec 2007)
> @@ -90,6 +90,9 @@
>  MPI_Aint *extra_state_f77;
>  } intercept_extra_state_t;
>  
> +OBJ_CLASS_DECLARATION(intercept_extra_state_t);
> +
> +#if !OMPI_PROFILE_LAYER
>  static void intercept_extra_state_constructor(intercept_extra_state_t *obj)
>  {
>  obj->read_fn_f77 = NULL;
> @@ -98,9 +101,6 @@
>  obj->extra_state_f77 = NULL;
>  }
>  
> -OBJ_CLASS_DECLARATION(intercept_extra_state_t);
> -
> -#if !OMPI_PROFILE_LAYER
>  OBJ_CLASS_INSTANCE(intercept_extra_state_t,
> opal_list_item_t,
> intercept_extra_state_constructor, NULL);
> ___
> svn-full mailing list
> svn-f...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full

Re: [OMPI devel] New BTL parameter

2007-12-12 Thread Jeff Squyres


On Dec 9, 2007, at 10:34 AM, Gleb Natapov wrote:


 Currently BTL has parameter btl_min_send_size that is no longer used.
I want to change it to be btl_rndv_eager_limit. This new parameter  
will
determine a size of a first fragment of rendezvous protocol. Now we  
use

btl_eager_limit to set its size. btl_rndv_eager_limit will have to be
smaller or equal to btl_eager_limit. By default it will be equal to
btl_eager_limit so no behavior change will be observed if default is
used.



Can you describe why it would be better to have the value less than  
the eager limit?


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] [PATCH] openib: clean-up connect to allow for new cm's

2007-12-12 Thread Jeff Squyres


I agree with Gleb's idea.  More below.

On Dec 12, 2007, at 12:24 PM, Jon Mason wrote:


Ok, glad I got this conversation started :)

So, we need a slight redesign to determine the cm method (unless  
forced

via commandline arg).  This can be determined by calling all the
individual open routines, and having them return a priority based on
their ability to function.  For example, the xoob open function will
check the mca_btl_openib_component.num_xrc_qps for a non-zero value  
and

return the priority based on that.

Of course, if forced then it will only call that specific open  
function

and throw any relevant errors as necessary.



Close, but I'd do it slightly differently:

- open() is *only* used for creating MCA params.  It's a bad name, but  
it's unfortunately the precedent throughout the rest of the OMPI code  
base.  :-\ (it has roots in the ompi_info command -- ompi_info has to  
be able to get a full list of all MCA params regardless of what  
hardware is available on the current system)


- during the openib component startup, we should add a query()  
function that does what you describe.  I.e., we query() each endpoint  
and it either returns a valid priority or "I don't want to be used  
with this endpoint."


- there should be a priority MCA param for every CPC.  Perhaps the CPC  
base can handle this...?  I'm not sure; it may need to be down in each  
CPC.


- the list of CPCs that want to run with each endpoint are ordered by  
priority (ties will be arbitrarily, but deterministically, broken --  
alphabetical?) and sent around in the modex.


- when a new connection comes up, the intersection of the CPC lists  
for the near and far endpoints is computed and the highest priority  
CPC is used to make the connection.  Since everyone has the same data,  
both sides will make the same decision.


- CPC init may have to change a bit -- more than one CPC may be used  
for a given endpoint because both the local module and the remote  
module are involved in making the decision of which CPC is used.


After this first cut is done, we should probably also add  
btl_openib_cpc_include and btl_openib_cpc_exclude as I described in a  
prior mail (just like *_if_include and *_if_exclude in several BTLs)  
to include/exclude sets of CPCs at run-time.



If this sounds sane, then let me know and I'll start coding it up.



This has actually been on my to-do list for too long; if you have the  
cycles to do this now, it would be great...


I'll make you a bargain: if you do the stuff above, I'll add in the  
configure/build mojo for selectively compiling the XOOB CPC or not  
(depending on whether the underlying system has XRC library support or  
not).  Cool?


Let's go off on a /tmp-public branch for this so we don't hose the  
trunk...  I just made /tmp-public/openib-cpc.


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] [PATCH] openib: clean-up connect to allow for new cm's

2007-12-12 Thread Jon Mason

Ok, glad I got this conversation started :)

So, we need a slight redesign to determine the cm method (unless forced
via commandline arg).  This can be determined by calling all the
individual open routines, and having them return a priority based on
their ability to function.  For example, the xoob open function will
check the mca_btl_openib_component.num_xrc_qps for a non-zero value and
return the priority based on that.

Of course, if forced then it will only call that specific open function
and throw any relevant errors as necessary.

If this sounds sane, then let me know and I'll start coding it up.

Thanks,
Jon

Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Jeff Squyres


Gleb --

How about making a tarball with this patch in it that can be thrown at  
everyone's MTT? (we can put the tarball on www.open-mpi.org somewhere)



On Dec 11, 2007, at 4:14 PM, Richard Graham wrote:

I will re-iterate my concern.  The code that is there now is mostly  
nine
years old (with some mods made when it was brought over to Open  
MPI).  It
took about 2 months of testing on systems with 5-13 way network  
parallelism
to track down all KNOWN race conditions.  This code is at the center  
of MPI
correctness, so I am VERY concerned about changing it w/o some very  
strong

reasons.  Not apposed, just very cautious.

Rich


On 12/11/07 11:47 AM, "Gleb Natapov"  wrote:


On Tue, Dec 11, 2007 at 08:36:42AM -0800, Andrew Friedley wrote:
Possibly, though I have results from a benchmark I've written  
indicating
the reordering happens at the sender.  I believe I found it was  
due to
the QP striping trick I use to get more bandwidth -- if you back  
down to

one QP (there's a define in the code you can change), the reordering
rate drops.

Ah, OK. My assumption was just from looking into code, so I may be
wrong.



Also I do not make any recursive calls to progress -- at least not
directly in the BTL; I can't speak for the upper layers.  The  
reason I
do many completions at once is that it is a big help in turning  
around
receive buffers, making it harder to run out of buffers and drop  
frags.
 I want to say there was some performance benefit as well but I  
can't

say for sure.

Currently upper layers of Open MPI may call BTL progress function
recursively. I hope this will change some day.



Andrew

Gleb Natapov wrote:

On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote:
Try UD, frags are reordered at a very high rate so should be a  
good test.
Good Idea I'll try this. BTW I thing the reason for such a high  
rate of

reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions
(500) and process them one by one and if progress function is  
called

recursively next 500 completion will be reordered versus previous
completions (reordering happens on a receiver, not sender).


Andrew

Richard Graham wrote:

Gleb,
 I would suggest that before this is checked in this be tested  
on a

system
that has N-way network parallelism, where N is as large as you  
can find.
This is a key bit of code for MPI correctness, and out-of-order  
operations
will break it, so you want to maximize the chance for such  
operations.


Rich


On 12/11/07 10:54 AM, "Gleb Natapov"  wrote:


Hi,

  I did a rewrite of matching code in OB1. I made it much  
simpler and 2
times smaller (which is good, less code - less bugs). I also  
got rid
of huge macros - very helpful if you need to debug something.  
There
is no performance degradation, actually I even see very small  
performance
improvement. I ran MTT with this patch and the result is the  
same as on
trunk. I would like to commit this to the trunk. The patch is  
attached

for everybody to try.

--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] SCTP BTL exclusivity value problem

2007-12-12 Thread Karol Mroz

I just read this thread... many thanks for applying the fix.

Jeff Squyres wrote:
> Done in r16942.
> 
> On Dec 12, 2007, at 10:45 AM, Gleb Natapov wrote:
> 
>> On Wed, Dec 12, 2007 at 10:31:37AM -0500, Jeff Squyres wrote:
>>> I'd be in favor of setting the TCP exclusivity to LOW+100 and setting
>>> SCTP exclusivity to LOW.
>> Fine with me.
>>
>>>
>>> On Dec 12, 2007, at 10:07 AM, Gleb Natapov wrote:
>>>
 On Wed, Dec 12, 2007 at 10:02:07AM -0500, Jeff Squyres wrote:
> Yes -- this came up in a prior thread.  See what I proposed:
>
>http://www.open-mpi.org/community/lists/devel/2007/12/2698.php
>
> (no one replied, so no action was taken)
>
> Are you on a system where the SCTP BTL is being built?  What kind  
> of
> environment is it?
 Red Hat Enterprise Linux AS release 4 (Nahant Update 5)

 # rpm -qa | grep sctp
 lksctp-tools-devel-1.0.2-6.4E.1
 lksctp-tools-doc-1.0.2-6.4E.1
 lksctp-tools-1.0.2-6.4E.1

>
>
> On Dec 12, 2007, at 9:38 AM, Gleb Natapov wrote:
>
>> Hi,
>>
>> SCTP BTL sets its exclusivity value to MCA_BTL_EXCLUSIVITY_LOW - 1
>> but MCA_BTL_EXCLUSIVITY_LOW is zero so actually it is set to max
>> exclusivity possible. Can somebody fix this please? May be we  
>> should
>> not
>> define MCA_BTL_EXCLUSIVITY_LOW to zero?
>>
>> --
>>  Gleb.
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> -- 
> Jeff Squyres
> Cisco Systems
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
 --
Gleb.
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> -- 
>>> Jeff Squyres
>>> Cisco Systems
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> --
>>  Gleb.
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 


-- 
Karol Mroz
km...@cs.ubc.ca

Re: [OMPI devel] initial SCTP BTL commit comments?

2007-12-12 Thread Andrew Friedley


Jeff Squyres wrote:
Alternatively, you could do what the ofud BTL does (a currently  
experimental BTL): look for the string "ofud" in the "btl" MCA  
parameter -- i.e., see if the user explicitly asked for the ofud BTL.   
If not found (doing the Right Things with the "^" operator, of  
course), then disable the ofud BTL by returning NULL from the  
component_init() function.


Either seems fine to me; the ofud method seems a little less elegant  
-- was there a reason not to use exclusivity here?  Was it just the  
fact that TCP's exclusivity is already the lowest possible value (0)?


Sorry.. try putting my name in the email or something so I know you're 
asking me.


I think there was but I don't remember right now.  If a low exclusivity 
for the UD BTL means it won't get used with the RC BTL, then that's 
fine.  I don't like that string parsing code anyway.  Suggestions on 
what to set the exclusivity to?


Andrew

Re: [OMPI devel] SCTP BTL exclusivity value problem

2007-12-12 Thread Jeff Squyres


Done in r16942.

On Dec 12, 2007, at 10:45 AM, Gleb Natapov wrote:


On Wed, Dec 12, 2007 at 10:31:37AM -0500, Jeff Squyres wrote:

I'd be in favor of setting the TCP exclusivity to LOW+100 and setting
SCTP exclusivity to LOW.

Fine with me.




On Dec 12, 2007, at 10:07 AM, Gleb Natapov wrote:


On Wed, Dec 12, 2007 at 10:02:07AM -0500, Jeff Squyres wrote:

Yes -- this came up in a prior thread.  See what I proposed:

   http://www.open-mpi.org/community/lists/devel/2007/12/2698.php

(no one replied, so no action was taken)

Are you on a system where the SCTP BTL is being built?  What kind  
of

environment is it?

Red Hat Enterprise Linux AS release 4 (Nahant Update 5)

# rpm -qa | grep sctp
lksctp-tools-devel-1.0.2-6.4E.1
lksctp-tools-doc-1.0.2-6.4E.1
lksctp-tools-1.0.2-6.4E.1





On Dec 12, 2007, at 9:38 AM, Gleb Natapov wrote:


Hi,

SCTP BTL sets its exclusivity value to MCA_BTL_EXCLUSIVITY_LOW - 1
but MCA_BTL_EXCLUSIVITY_LOW is zero so actually it is set to max
exclusivity possible. Can somebody fix this please? May be we  
should

not
define MCA_BTL_EXCLUSIVITY_LOW to zero?

--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] SCTP BTL exclusivity value problem

2007-12-12 Thread Gleb Natapov

On Wed, Dec 12, 2007 at 10:31:37AM -0500, Jeff Squyres wrote:
> I'd be in favor of setting the TCP exclusivity to LOW+100 and setting  
> SCTP exclusivity to LOW.
Fine with me.

> 
> 
> On Dec 12, 2007, at 10:07 AM, Gleb Natapov wrote:
> 
> > On Wed, Dec 12, 2007 at 10:02:07AM -0500, Jeff Squyres wrote:
> >> Yes -- this came up in a prior thread.  See what I proposed:
> >>
> >> http://www.open-mpi.org/community/lists/devel/2007/12/2698.php
> >>
> >> (no one replied, so no action was taken)
> >>
> >> Are you on a system where the SCTP BTL is being built?  What kind of
> >> environment is it?
> > Red Hat Enterprise Linux AS release 4 (Nahant Update 5)
> >
> > # rpm -qa | grep sctp
> > lksctp-tools-devel-1.0.2-6.4E.1
> > lksctp-tools-doc-1.0.2-6.4E.1
> > lksctp-tools-1.0.2-6.4E.1
> >
> >>
> >>
> >>
> >> On Dec 12, 2007, at 9:38 AM, Gleb Natapov wrote:
> >>
> >>> Hi,
> >>>
> >>> SCTP BTL sets its exclusivity value to MCA_BTL_EXCLUSIVITY_LOW - 1
> >>> but MCA_BTL_EXCLUSIVITY_LOW is zero so actually it is set to max
> >>> exclusivity possible. Can somebody fix this please? May be we should
> >>> not
> >>> define MCA_BTL_EXCLUSIVITY_LOW to zero?
> >>>
> >>> --
> >>>   Gleb.
> >>> ___
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >>
> >> -- 
> >> Jeff Squyres
> >> Cisco Systems
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > --
> > Gleb.
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> Cisco Systems
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.

Re: [OMPI devel] SCTP BTL exclusivity value problem

2007-12-12 Thread Jeff Squyres

I'd be in favor of setting the TCP exclusivity to LOW+100 and setting  
SCTP exclusivity to LOW.



On Dec 12, 2007, at 10:07 AM, Gleb Natapov wrote:


On Wed, Dec 12, 2007 at 10:02:07AM -0500, Jeff Squyres wrote:

Yes -- this came up in a prior thread.  See what I proposed:

http://www.open-mpi.org/community/lists/devel/2007/12/2698.php

(no one replied, so no action was taken)

Are you on a system where the SCTP BTL is being built?  What kind of
environment is it?

Red Hat Enterprise Linux AS release 4 (Nahant Update 5)

# rpm -qa | grep sctp
lksctp-tools-devel-1.0.2-6.4E.1
lksctp-tools-doc-1.0.2-6.4E.1
lksctp-tools-1.0.2-6.4E.1





On Dec 12, 2007, at 9:38 AM, Gleb Natapov wrote:


Hi,

SCTP BTL sets its exclusivity value to MCA_BTL_EXCLUSIVITY_LOW - 1
but MCA_BTL_EXCLUSIVITY_LOW is zero so actually it is set to max
exclusivity possible. Can somebody fix this please? May be we should
not
define MCA_BTL_EXCLUSIVITY_LOW to zero?

--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] SCTP BTL exclusivity value problem

2007-12-12 Thread Gleb Natapov

On Wed, Dec 12, 2007 at 10:02:07AM -0500, Jeff Squyres wrote:
> Yes -- this came up in a prior thread.  See what I proposed:
> 
>  http://www.open-mpi.org/community/lists/devel/2007/12/2698.php
> 
> (no one replied, so no action was taken)
> 
> Are you on a system where the SCTP BTL is being built?  What kind of  
> environment is it?
Red Hat Enterprise Linux AS release 4 (Nahant Update 5)

# rpm -qa | grep sctp
lksctp-tools-devel-1.0.2-6.4E.1
lksctp-tools-doc-1.0.2-6.4E.1
lksctp-tools-1.0.2-6.4E.1

> 
> 
> 
> On Dec 12, 2007, at 9:38 AM, Gleb Natapov wrote:
> 
> > Hi,
> >
> >  SCTP BTL sets its exclusivity value to MCA_BTL_EXCLUSIVITY_LOW - 1
> > but MCA_BTL_EXCLUSIVITY_LOW is zero so actually it is set to max
> > exclusivity possible. Can somebody fix this please? May be we should  
> > not
> > define MCA_BTL_EXCLUSIVITY_LOW to zero?
> >
> > --
> > Gleb.
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> Cisco Systems
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.

Re: [OMPI devel] SCTP BTL exclusivity value problem

2007-12-12 Thread Jeff Squyres


Yes -- this came up in a prior thread.  See what I proposed:

http://www.open-mpi.org/community/lists/devel/2007/12/2698.php

(no one replied, so no action was taken)

Are you on a system where the SCTP BTL is being built?  What kind of  
environment is it?




On Dec 12, 2007, at 9:38 AM, Gleb Natapov wrote:


Hi,

 SCTP BTL sets its exclusivity value to MCA_BTL_EXCLUSIVITY_LOW - 1
but MCA_BTL_EXCLUSIVITY_LOW is zero so actually it is set to max
exclusivity possible. Can somebody fix this please? May be we should  
not

define MCA_BTL_EXCLUSIVITY_LOW to zero?

--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

[OMPI devel] SCTP BTL exclusivity value problem

2007-12-12 Thread Gleb Natapov

Hi,

  SCTP BTL sets its exclusivity value to MCA_BTL_EXCLUSIVITY_LOW - 1
but MCA_BTL_EXCLUSIVITY_LOW is zero so actually it is set to max
exclusivity possible. Can somebody fix this please? May be we should not
define MCA_BTL_EXCLUSIVITY_LOW to zero?

--
Gleb.

Re: [OMPI devel] [PATCH] openib: clean-up connect to allow for new cm's

2007-12-12 Thread Gleb Natapov

On Wed, Dec 12, 2007 at 04:08:31PM +0200, Pavel Shamis (Pasha) wrote:
> Gleb Natapov wrote:
>> On Wed, Dec 12, 2007 at 03:37:26PM +0200, Pavel Shamis (Pasha) wrote:
>>   
>>> Gleb Natapov wrote:
>>> 
 On Tue, Dec 11, 2007 at 08:16:07PM -0500, Jeff Squyres wrote:
 
> Isn't there a better way somehow?  Perhaps we should have "select"  
> call *all* the functions and accept back a priority.  The one with the  
> highest priority then wins.  This is quite similar to much of the  
> other selection logic in OMPI.
>
> Sidenote: Keep in mind that there are some changes coming to select  
> CPCs on a per-endpoint basis (I can't look up the trac ticket right  
> now...).  This makes things a little complicated -- do we need  
> btl_openib_cpc_include and btl_openib_cpc_exclude MCA params to  
> include/exclude CPCs (because you might need more than one CPC in a  
> single job)?  That wouldn't be hard to do.
>
> But then what to do about if someone sets to use some XRC QPs and  
> selects to use OOB or RDMA CM?  How do we catch this and print an  
> error?  It doesn't seem right to put the "if num_xrc_qps>0" check in  
> every CPC.  What happens if you try to make an XRC QP when not using  
> xoob?  Where is the error detected and what kind of error message do  
> we print?
>
> 
 In my opinion "X" notation for QP specification should be removed. I
 didn't want this to prevent XRC merging so I haven't raced this point.
 It is enough to have two types of QPs "P" - SW credit management "S" -
 HW credit management.   
>>> How will you decide witch QP type to use ? (SRQ or XRC)
>>>
>>> 
>> If both sides support XOOB and priority of XOOB is higher then all other 
>> CPC
>> then create XRC, otherwise use regular RC.
>>   
> If some body have connectX hca but  he want to use SRQ and not XRC ?
This will be the default. (prio of OOB will be bigger than of XOOB), but
if uses will want to use XRC it will increase XOOB priority by
specifying MCA parameter.

> I guess anyway we will be need some additional parameter that will allow 
> enable/disable XRC, correct ? (So why just not leave the X qp type ?)
Because we want to support mixed setups and create XRC between nodes that
support it and RC between all other nodes.

--
Gleb.

Re: [OMPI devel] [PATCH] openib: clean-up connect to allow for new cm's

2007-12-12 Thread Pavel Shamis (Pasha)


Gleb Natapov wrote:

On Wed, Dec 12, 2007 at 03:37:26PM +0200, Pavel Shamis (Pasha) wrote:
  

Gleb Natapov wrote:


On Tue, Dec 11, 2007 at 08:16:07PM -0500, Jeff Squyres wrote:
  
  
Isn't there a better way somehow?  Perhaps we should have "select"  
call *all* the functions and accept back a priority.  The one with the  
highest priority then wins.  This is quite similar to much of the  
other selection logic in OMPI.


Sidenote: Keep in mind that there are some changes coming to select  
CPCs on a per-endpoint basis (I can't look up the trac ticket right  
now...).  This makes things a little complicated -- do we need  
btl_openib_cpc_include and btl_openib_cpc_exclude MCA params to  
include/exclude CPCs (because you might need more than one CPC in a  
single job)?  That wouldn't be hard to do.


But then what to do about if someone sets to use some XRC QPs and  
selects to use OOB or RDMA CM?  How do we catch this and print an  
error?  It doesn't seem right to put the "if num_xrc_qps>0" check in  
every CPC.  What happens if you try to make an XRC QP when not using  
xoob?  Where is the error detected and what kind of error message do  
we print?





In my opinion "X" notation for QP specification should be removed. I
didn't want this to prevent XRC merging so I haven't raced this point.
It is enough to have two types of QPs "P" - SW credit management "S" -
HW credit management. 
  

How will you decide witch QP type to use ? (SRQ or XRC)



If both sides support XOOB and priority of XOOB is higher then all other CPC
then create XRC, otherwise use regular RC.
  

If some body have connectX hca but  he want to use SRQ and not XRC ?
I guess anyway we will be need some additional parameter that will allow 
enable/disable XRC, correct ? (So why just not leave the X qp type ?)

Re: [OMPI devel] [PATCH] openib: clean-up connect to allow for new cm's

2007-12-12 Thread Gleb Natapov

On Wed, Dec 12, 2007 at 03:37:26PM +0200, Pavel Shamis (Pasha) wrote:
> Gleb Natapov wrote:
> > On Tue, Dec 11, 2007 at 08:16:07PM -0500, Jeff Squyres wrote:
> >   
> >> Isn't there a better way somehow?  Perhaps we should have "select"  
> >> call *all* the functions and accept back a priority.  The one with the  
> >> highest priority then wins.  This is quite similar to much of the  
> >> other selection logic in OMPI.
> >>
> >> Sidenote: Keep in mind that there are some changes coming to select  
> >> CPCs on a per-endpoint basis (I can't look up the trac ticket right  
> >> now...).  This makes things a little complicated -- do we need  
> >> btl_openib_cpc_include and btl_openib_cpc_exclude MCA params to  
> >> include/exclude CPCs (because you might need more than one CPC in a  
> >> single job)?  That wouldn't be hard to do.
> >>
> >> But then what to do about if someone sets to use some XRC QPs and  
> >> selects to use OOB or RDMA CM?  How do we catch this and print an  
> >> error?  It doesn't seem right to put the "if num_xrc_qps>0" check in  
> >> every CPC.  What happens if you try to make an XRC QP when not using  
> >> xoob?  Where is the error detected and what kind of error message do  
> >> we print?
> >>
> >> 
> > In my opinion "X" notation for QP specification should be removed. I
> > didn't want this to prevent XRC merging so I haven't raced this point.
> > It is enough to have two types of QPs "P" - SW credit management "S" -
> > HW credit management. 
> How will you decide witch QP type to use ? (SRQ or XRC)
> 
If both sides support XOOB and priority of XOOB is higher then all other CPC
then create XRC, otherwise use regular RC.

--
Gleb.

Re: [OMPI devel] [PATCH] openib: clean-up connect to allow for new cm's

2007-12-12 Thread Pavel Shamis (Pasha)


Gleb Natapov wrote:

On Tue, Dec 11, 2007 at 08:16:07PM -0500, Jeff Squyres wrote:
  
Isn't there a better way somehow?  Perhaps we should have "select"  
call *all* the functions and accept back a priority.  The one with the  
highest priority then wins.  This is quite similar to much of the  
other selection logic in OMPI.


Sidenote: Keep in mind that there are some changes coming to select  
CPCs on a per-endpoint basis (I can't look up the trac ticket right  
now...).  This makes things a little complicated -- do we need  
btl_openib_cpc_include and btl_openib_cpc_exclude MCA params to  
include/exclude CPCs (because you might need more than one CPC in a  
single job)?  That wouldn't be hard to do.


But then what to do about if someone sets to use some XRC QPs and  
selects to use OOB or RDMA CM?  How do we catch this and print an  
error?  It doesn't seem right to put the "if num_xrc_qps>0" check in  
every CPC.  What happens if you try to make an XRC QP when not using  
xoob?  Where is the error detected and what kind of error message do  
we print?




In my opinion "X" notation for QP specification should be removed. I
didn't want this to prevent XRC merging so I haven't raced this point.
It is enough to have two types of QPs "P" - SW credit management "S" -
HW credit management. 

How will you decide witch QP type to use ? (SRQ or XRC)


I think connection management should work like
this: Each BTL knows what type of CPC it can use and it should share
this info during modex stage. During connection establishment modex info
is used to figure out the list of CPCs that both endpoints support and one
with highest prio is selected.
  

ok for me.

--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [PATCH] openib: clean-up connect to allow for new cm's

2007-12-12 Thread Gleb Natapov

On Tue, Dec 11, 2007 at 08:16:07PM -0500, Jeff Squyres wrote:
> Isn't there a better way somehow?  Perhaps we should have "select"  
> call *all* the functions and accept back a priority.  The one with the  
> highest priority then wins.  This is quite similar to much of the  
> other selection logic in OMPI.
> 
> Sidenote: Keep in mind that there are some changes coming to select  
> CPCs on a per-endpoint basis (I can't look up the trac ticket right  
> now...).  This makes things a little complicated -- do we need  
> btl_openib_cpc_include and btl_openib_cpc_exclude MCA params to  
> include/exclude CPCs (because you might need more than one CPC in a  
> single job)?  That wouldn't be hard to do.
> 
> But then what to do about if someone sets to use some XRC QPs and  
> selects to use OOB or RDMA CM?  How do we catch this and print an  
> error?  It doesn't seem right to put the "if num_xrc_qps>0" check in  
> every CPC.  What happens if you try to make an XRC QP when not using  
> xoob?  Where is the error detected and what kind of error message do  
> we print?
> 
In my opinion "X" notation for QP specification should be removed. I
didn't want this to prevent XRC merging so I haven't raced this point.
It is enough to have two types of QPs "P" - SW credit management "S" -
HW credit management. I think connection management should work like
this: Each BTL knows what type of CPC it can use and it should share
this info during modex stage. During connection establishment modex info
is used to figure out the list of CPCs that both endpoints support and one
with highest prio is selected.

--
Gleb.

Re: [OMPI devel] [PATCH] openib: clean-up connect to allow for new cm's

2007-12-12 Thread Jeff Squyres


On Dec 12, 2007, at 5:13 AM, Pavel Shamis (Pasha) wrote:


Hmm.  I don't think that we want to put knowledge of XRC in the OOB
CPC (and vice versa).  That seems like an abstraction violation.

I didn't like that XRC knowledge was put in the connect base either,
but I was too busy to argue with it.  :-)

Isn't there a better way somehow?  Perhaps we should have "select"
call *all* the functions and accept back a priority.  The one with  
the

highest priority then wins.  This is quite similar to much of the
other selection logic in OMPI.

Sidenote: Keep in mind that there are some changes coming to select
CPCs on a per-endpoint basis (I can't look up the trac ticket right
now...).  This makes things a little complicated -- do we need
btl_openib_cpc_include and btl_openib_cpc_exclude MCA params to
include/exclude CPCs (because you might need more than one CPC in a
single job)?  That wouldn't be hard to do.

But then what to do about if someone sets to use some XRC QPs and
selects to use OOB or RDMA CM?
Error message will be reported , that for using XRC you _must_  
select xoob.


I understand that that is what it does today; I was asking my somewhat- 
rhetorical question with the above text in mind (that we remove the  
abstraction violations -- remove knowledge of XRC from the OOB CPC,  
etc.).



How do we catch this and print an
error?  It doesn't seem right to put the "if num_xrc_qps>0" check in
every CPC.  What happens if you try to make an XRC QP when not using
xoob?



Where is the error detected and what kind of error message do
we print?


I would like to remind 2 things:
1. XRC little bit change QP logic. We creates one XRC qp for send and
one for recv. As result
it require absolutely different oob mechanism.
2. Current implementation doesn't allow to run with XRC  + RC (or srq)
and I don't think that we need such mixed configuration
support at all.

So as results the the XRC may work only with XOOB. If you will try to
run it with oob error message will be reported.
As well if you will try to run !(XRC) with XOOB error message will be
reported too.

The check is located in ompi_btl_openib_connect_base_open.


I understand all of that.  I think the question is if there is a way  
to de-centralize these checks such that the XOOB CPC can be the one  
that figures this stuff out (for example) rather than having to put  
this in the base.


The original code in the function used oob as default connection  
method.

I changed it to check
in which mode we are running (xrc enabled/disabled) and make xoob
default connection manager for xrc mode
and oob make default for not xrc mode.


Right -- this is problematic for adding IBCM and RDMA CM; that's Jon's  
point.



I  not sure that oob cpc is the best place for the logic.
also I don't think that the "select + priority" solution will resolve
the dependences problem:
XRC enabled -> xoob
XRC disabled -> oob , cm.

We may push the logic outside of cpc  and pass to
ompi_btl_openib_connect_base_open()
witch connection manger we want to use. I guess that the change also
will be usefull for future "CPCs on a per-endpoint basis" changes.


From an abstraction point of view, it would be nice to get all this  
CPC-specific information out of the base and into the CPCs that they  
belong to.



Also, I'm not sure why the #if/#else is there for xoob (i.e., having
empty/printf functions there when XRC support is compiled out) -- if
xoob was disabled during compilation, then it simply should not be
compiled and therefore not be there at all at run-time.  If a user
selects the xoob CPC, then we should print a message from the base
that that CPC doesn't exist in the installation.  Correspondingly, we
can make an info MCA param in the btl openib that shows which CPCs  
are
available (we already have this information -- it's easy enough to  
put

this in an info MCA param).


Sounds reasonable for me.

Pasha.


On Dec 11, 2007, at 6:59 PM, Jon Mason wrote:



Currently, alternate CMs cannot be called because
ompi_btl_openib_connect_base_open forces a choice of either oob or
xoob
(and goes into an erroneous error path if you pick something else).
This patch reorganizes ompi_btl_openib_connect_base_open so that new
functions can easily be added.  New Open functions were added to oob
and xoob for the error handling.

I tested calling oob, xoob, and rdma_cm.  oob happily allows
connections
to be established and throws no errors.  xoob fails because ompi  
does
not have it compiled in (and I have no connectx cards).  rdma_cm  
calls

the empty hooks and exits without connecting (thus throwing
non-connection errors).  All expected behavior.

Since this patch fixes the existing behavior, and is not necessarily
tied to my implementing of rdma_cm, I think it is acceptable to go  
in

now.

Thanks,
Jon

Index: ompi/mca/btl/openib/connect/btl_openib_connect_base.c
===
--- ompi/mca/btl/openib/connect/btl_openib_connect_ba

Re: [OMPI devel] [PATCH] openib: clean-up connect to allow for new cm's

2007-12-12 Thread Pavel Shamis (Pasha)


Jeff Squyres wrote:
Hmm.  I don't think that we want to put knowledge of XRC in the OOB  
CPC (and vice versa).  That seems like an abstraction violation.


I didn't like that XRC knowledge was put in the connect base either,  
but I was too busy to argue with it.  :-)


Isn't there a better way somehow?  Perhaps we should have "select"  
call *all* the functions and accept back a priority.  The one with the  
highest priority then wins.  This is quite similar to much of the  
other selection logic in OMPI.


Sidenote: Keep in mind that there are some changes coming to select  
CPCs on a per-endpoint basis (I can't look up the trac ticket right  
now...).  This makes things a little complicated -- do we need  
btl_openib_cpc_include and btl_openib_cpc_exclude MCA params to  
include/exclude CPCs (because you might need more than one CPC in a  
single job)?  That wouldn't be hard to do.


But then what to do about if someone sets to use some XRC QPs and  
selects to use OOB or RDMA CM?  

Error message will be reported , that for using XRC you _must_ select xoob.
How do we catch this and print an  
error?  It doesn't seem right to put the "if num_xrc_qps>0" check in  
every CPC.  What happens if you try to make an XRC QP when not using  
xoob?  


Where is the error detected and what kind of error message do  
we print?
  

I would like to remind 2 things:
1. XRC little bit change QP logic. We creates one XRC qp for send and 
one for recv. As result

it require absolutely different oob mechanism.
2. Current implementation doesn't allow to run with XRC  + RC (or srq) 
and I don't think that we need such mixed configuration

support at all.

So as results the the XRC may work only with XOOB. If you will try to 
run it with oob error message will be reported.
As well if you will try to run !(XRC) with XOOB error message will be 
reported too.


The check is located in ompi_btl_openib_connect_base_open.

The original code in the function used oob as default connection method. 
I changed it to check
in which mode we are running (xrc enabled/disabled) and make xoob 
default connection manager for xrc mode

and oob make default for not xrc mode.

I  not sure that oob cpc is the best place for the logic.
also I don't think that the "select + priority" solution will resolve 
the dependences problem:

XRC enabled -> xoob
XRC disabled -> oob , cm.

We may push the logic outside of cpc  and pass to 
ompi_btl_openib_connect_base_open()
witch connection manger we want to use. I guess that the change also 
will be usefull for future "CPCs on a per-endpoint basis" changes.


Also, I'm not sure why the #if/#else is there for xoob (i.e., having  
empty/printf functions there when XRC support is compiled out) -- if  
xoob was disabled during compilation, then it simply should not be  
compiled and therefore not be there at all at run-time.  If a user  
selects the xoob CPC, then we should print a message from the base  
that that CPC doesn't exist in the installation.  Correspondingly, we  
can make an info MCA param in the btl openib that shows which CPCs are  
available (we already have this information -- it's easy enough to put  
this in an info MCA param).
  

Sounds reasonable for me.

Pasha.


On Dec 11, 2007, at 6:59 PM, Jon Mason wrote:

  

Currently, alternate CMs cannot be called because
ompi_btl_openib_connect_base_open forces a choice of either oob or  
xoob

(and goes into an erroneous error path if you pick something else).
This patch reorganizes ompi_btl_openib_connect_base_open so that new
functions can easily be added.  New Open functions were added to oob
and xoob for the error handling.

I tested calling oob, xoob, and rdma_cm.  oob happily allows  
connections

to be established and throws no errors.  xoob fails because ompi does
not have it compiled in (and I have no connectx cards).  rdma_cm calls
the empty hooks and exits without connecting (thus throwing
non-connection errors).  All expected behavior.

Since this patch fixes the existing behavior, and is not necessarily
tied to my implementing of rdma_cm, I think it is acceptable to go in
now.

Thanks,
Jon

Index: ompi/mca/btl/openib/connect/btl_openib_connect_base.c
===
--- ompi/mca/btl/openib/connect/btl_openib_connect_base.c	(revision  
16937)
+++ ompi/mca/btl/openib/connect/btl_openib_connect_base.c	(working  
copy)

@@ -50,8 +50,8 @@
 */
int ompi_btl_openib_connect_base_open(void)
{
-int i;
-char **temp, *a, *b;
+char **temp, *a, *b, *defval;
+int i, ret = OMPI_ERROR;

/* Make an MCA parameter to select which connect module to use */
temp = NULL;
@@ -66,40 +66,23 @@

/* For XRC qps we must to use XOOB connection manager */
if (mca_btl_openib_component.num_xrc_qps > 0) {
- 
mca_base_param_reg_string(&mca_btl_openib_component.super.btl_version,

-"connect",
-b, false, false,
-"xoob", ¶m);
-

40 matches

Mail list logo