[OMPI devel] SM BTL NUMA awareness patches

2008-05-28 Thread Gleb Natapov
Hi,

Attached two patches implement NUMA awareness in SM BTL. The first one
adds two new functions to maffinity framework required by the second
patch. The functions are:

 opal_maffinity_base_node_name_to_id() - gets a string that represents a
 memory node name and translates
 it to memory node id.
 opal_maffinity_base_bind()- binds an address range to specific
 memory node.

The bind() function cannot be implemented by all maffinity components.
(There is no way first_use maffinity component can implement such
functionality). In this case this function can be set to NULL.

The second one adds NUMA awareness support to SM BTL and SM MPOOL. Each
process determines what CPU it is running on and exchange this info with
other local processes. Each process creates separate MPOOL for every
memory node available and use them to allocate memory on specific memory
nodes if needed. For instance circular buffer memory is always allocated
on memory node local to receiver process.

To use this on a Linux machine carto file with HW topology description should
be provided. Processes should be bound to specific CPU (by specifying
rank file for instance) and session directory should be created on tmpfs
file system (otherwise Linux ignores memory binding commands) by
setting orte_tmpdir_base parameter to point to tmpfs mount point.

Questions and suggestion are alway welcome.

--
Gleb.
commit 883db5e1ce8c3b49cc1376e6acf9c2d5d0d77983
Author: Gleb Natapov 
List-Post: devel@lists.open-mpi.org
Date:   Tue May 27 14:55:11 2008 +0300

Add functions to maffinity.

diff --git a/opal/mca/maffinity/base/base.h b/opal/mca/maffinity/base/base.h
index c44efed..339e6a1 100644
--- a/opal/mca/maffinity/base/base.h
+++ b/opal/mca/maffinity/base/base.h
@@ -105,6 +105,9 @@ OPAL_DECLSPEC int opal_maffinity_base_select(void);
  */
 OPAL_DECLSPEC int opal_maffinity_base_set(opal_maffinity_base_segment_t *segments, size_t num_segments);

+OPAL_DECLSPEC int opal_maffinity_base_node_name_to_id(char *, int *);
+OPAL_DECLSPEC int opal_maffinity_base_bind(opal_maffinity_base_segment_t *, size_t, int);
+
 /**
  * Shut down the maffinity MCA framework.
  *
diff --git a/opal/mca/maffinity/base/maffinity_base_wrappers.c b/opal/mca/maffinity/base/maffinity_base_wrappers.c
index ec843eb..eef5c7d 100644
--- a/opal/mca/maffinity/base/maffinity_base_wrappers.c
+++ b/opal/mca/maffinity/base/maffinity_base_wrappers.c
@@ -31,3 +31,33 @@ int opal_maffinity_base_set(opal_maffinity_base_segment_t *segments,
 }
 return opal_maffinity_base_module->maff_module_set(segments, num_segments);
 }
+
+int opal_maffinity_base_node_name_to_id(char *node_name, int *node_id)
+{
+if (!opal_maffinity_base_selected) {
+return OPAL_ERR_NOT_FOUND;
+}
+
+if (!opal_maffinity_base_module->maff_module_name_to_id) {
+*node_id = 0;
+return OPAL_ERR_NOT_IMPLEMENTED;
+}
+
+return opal_maffinity_base_module->maff_module_name_to_id(node_name,
+node_id);
+}
+
+int opal_maffinity_base_bind(opal_maffinity_base_segment_t *segments,
+size_t num_segments, int node_id)
+{
+if (!opal_maffinity_base_selected) {
+return OPAL_ERR_NOT_FOUND;
+}
+
+if (!opal_maffinity_base_module->maff_module_bind) {
+return OPAL_ERR_NOT_IMPLEMENTED;
+}
+
+return opal_maffinity_base_module->maff_module_bind(segments, num_segments,
+node_id);
+}
diff --git a/opal/mca/maffinity/first_use/maffinity_first_use_module.c b/opal/mca/maffinity/first_use/maffinity_first_use_module.c
index a68c2a9..0ae33e1 100644
--- a/opal/mca/maffinity/first_use/maffinity_first_use_module.c
+++ b/opal/mca/maffinity/first_use/maffinity_first_use_module.c
@@ -41,7 +41,9 @@ static const opal_maffinity_base_module_1_0_0_t loc_module = {
 first_use_module_init,

 /* Module function pointers */
-first_use_module_set
+first_use_module_set,
+NULL,
+NULL
 };

 int opal_maffinity_first_use_component_query(mca_base_module_t **module, int *priority)
diff --git a/opal/mca/maffinity/libnuma/maffinity_libnuma_module.c b/opal/mca/maffinity/libnuma/maffinity_libnuma_module.c
index 1fc2231..b2b109c 100644
--- a/opal/mca/maffinity/libnuma/maffinity_libnuma_module.c
+++ b/opal/mca/maffinity/libnuma/maffinity_libnuma_module.c
@@ -20,6 +20,7 @@

 #include 
 #include 
+#include 

 #include "opal/constants.h"
 #include "opal/mca/maffinity/maffinity.h"
@@ -33,6 +34,8 @@
 static int libnuma_module_init(void);
 static int libnuma_module_set(opal_maffinity_base_segment_t *segments,
   size_t num_segments);
+static int libnuma_module_node_name_to_id(char *, int *);
+static int libnuma_modules_bind(opal_maffinity_base_segment_t *, size_t, int);

 /*
  * Libnuma maffinity module
@@ -42,7 +45,9 @@ static const

Re: [OMPI devel] Open MPI session directory location

2008-05-27 Thread Gleb Natapov
On Tue, May 27, 2008 at 08:27:49AM -0600, Ralph H Castain wrote:
> -mca orte_tmpdir_base foo
Thanks! It works. But this parameter is not reported by ompi_info :(

> 
> 
> 
> On 5/27/08 8:24 AM, "Gleb Natapov"  wrote:
> 
> > Hi,
> > 
> >   Is there a way to change where Open MPI creates session directory. I
> > can't find mca parameter that specifies this.
> > 
> > --
> > Gleb.
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


[OMPI devel] Open MPI session directory location

2008-05-27 Thread Gleb Natapov
Hi,

  Is there a way to change where Open MPI creates session directory. I
can't find mca parameter that specifies this.

--
Gleb.


Re: [OMPI devel] Memory hooks stuff

2008-05-26 Thread Gleb Natapov
On Sun, May 25, 2008 at 10:54:23AM -0400, Patrick Geoffray wrote:
> Jeff Squyres wrote:
> > That would also be great.  I don't know anything about these mmu  
> > notifiers (I'm not much of a kernel guy), but anything that allows us  
> 
> It's what Quadrics used for years in True64. Instead of trying to catch 
> at user-level all instances when the page table of a process is modified 
> (free, munmap, sbrk, etc...), the kernel notifies you when that happens.
Not just that but also when swapping out or pagefault happens so even no
page pinning is needed. But HW should be designed to work with changing
page mappings and I am not sure that Mellanox HW designed for that. What
about Myricom HW?

--
Gleb.


Re: [OMPI devel] Memory hooks stuff

2008-05-23 Thread Gleb Natapov
On Fri, May 23, 2008 at 07:19:01AM -0400, Jeff Squyres wrote:
> Brian and I were chatting the other day about random OMPI stuff and  
> the topic of the memory hooks came up again.  Brian was wondering if  
> we should [finally] revisit this topic -- there's a few things that  
> could be done to make life "better".  Two things jump to mind:
> 
> - using mallopt on Linux
> - doing *something* on Solaris
> 
> It would probably be worthwhile to have a teleconf about this in the  
> near future for anyone who is interested.  I propose any time before  
> 4pm US Eastern on Wednesday, 28 May, 2008.
> 
> Who would be interested in discussing this stuff?  (me, Brian, ? 
> someone from Sun?, ...?)
> 
Me.

--
Gleb.


Re: [OMPI devel] RFC: Linuxes shipping libibverbs

2008-05-23 Thread Gleb Natapov
On Thu, May 22, 2008 at 08:30:52PM +, Dirk Eddelbuettel wrote:
> > > Also, if this test depends on the Debian kernel packages, then we're
> > > back to square one as some folks (like myself) run binary kernels,
> > > other may just hand-compile and this test may not work as we may miss
> > > the 'Debian trigger' in those cases.
> > 
> > 
> > The OpenFabrics kernel drivers are implemented as kernel modules, so  
> > it's mainly just a question of loading them it to start them running.   
> > For example, in the official OFED distribution, it comes with /etc/ 
> 
> Do you have any information whether OFED is in fact packaged for
> Debian?  It may not be, and hence no file ...
> 
AFAIK OFED is not packaged for debian. Ronald packages IB for debian.

--
Gleb.


Re: [OMPI devel] RFC: Linuxes shipping libibverbs

2008-05-23 Thread Gleb Natapov
On Thu, May 22, 2008 at 04:19:05PM -0400, Jeff Squyres wrote:
> On May 22, 2008, at 4:07 PM, Dirk Eddelbuettel wrote:
> 
> > Is there a test I could run for you?
> 
> Can you see if /dev/infiniband exists?  If it does, the OpenFabrics  
> kernel drivers are running.  If not, they aren't.
Either that or udev in not configured properly.

> 
> > Also, if this test depends on the Debian kernel packages, then we're
> > back to square one as some folks (like myself) run binary kernels,
> > other may just hand-compile and this test may not work as we may miss
> > the 'Debian trigger' in those cases.
> 
> 
> The OpenFabrics kernel drivers are implemented as kernel modules, so  
> it's mainly just a question of loading them it to start them running.   
> For example, in the official OFED distribution, it comes with /etc/ 
> init.d/openibd -- "start" loads the kernel modules and does all the  
> necessary initialization, "stop" unloads everything, etc.
> 
ib_core/mthca/mlx4 should be loaded automatically by hotplug if HW is
present. No need for any additional configuration.

--
Gleb.


Re: [OMPI devel] Threaded progress for CPCs

2008-05-20 Thread Gleb Natapov
On Mon, May 19, 2008 at 01:38:53PM -0400, Jeff Squyres wrote:
> >> 5. ...?
> > What about moving posting of receive buffers into main thread. With
> > SRQ it is easy: don't post anything in CPC thread. Main thread will
> > prepost buffers automatically after first fragment received on the
> > endpoint (in btl_openib_handle_incoming()). With PPRQ it's more
> > complicated. What if we'll prepost dummy buffers (not from free list)
> > during IBCM connection stage and will run another three way handshake
> > protocol using those buffers, but from the main thread. We will need  
> > to
> > prepost one buffer on the active side and two buffers on the passive  
> > side.
> 
> 
> This is probably the most viable alternative -- it would be easiest if  
> we did this for all CPC's, not just for IBCM:
> 
> - for PPRQ: CPCs only post a small number of receive buffers, suitable  
> for another handshake that will run in the upper-level openib BTL
> - for SRQ: CPCs don't post anything (because the SRQ already "belongs"  
> to the upper level openib BTL)
> 
> Do we have a BSRQ restriction that there *must* be at least one PPRQ?   
No. We don't have such restriction and I wouldn't want to add it.

> If so, we could always run the upper-level openib BTL really-post-the- 
> buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e.,  
> have the CPC post a single receive on this QP -- see below), which  
> would make things much easier.  If we don't already have this  
> restriction, would we mind adding it?  We have one PPRQ in our default  
> receive_queues value, anyway.
If there is not PPRQ then we can relay on RNR/retransmit logic in case
there is not enough buffer in SRQ. We do that anyway in openib BTL code.

> 
> With this rationale, once the CPC says "ok, all BSRQ QP's are  
> connected", then _endpoint.c can run a CTS handshake to post the  
> "real" buffers, where each side does the following:
> 
> - CPC calls _endpoint_connected() to tell the upper level BTL that it  
> is fully connected (the function is invoked in the main thread)
> - _endpoint_connected() posts all the "real" buffers to all the BSRQ  
> QP's on the endpoint
> - _endpoint_connected() then sends a CTS control message to remote  
> peer via smallest RC PPRQ
> - upon receipt of CTS:
>- release the buffer (***)
>- set endpoint state of CONNECTED and let all pending messages  
> flow... (as it happens today)
> 
> So it actually doesn't even have to be a handshake -- it's just an  
> additional CTS sent over the newly-created RC QP.  Since it's RC, we  
> don't have to do much -- just wait for the CTS to know that the remote  
> side has actually posted all the receives that we expect it to have.   
> Since the CTS flows over a PPRQ, there's no issue about receiving the  
> CTS on an SRQ (because the SRQ may not have any buffers posted at any  
> given time).
Correct. Full handshake is not needed. The trick is to allocate those
initial buffers in a smart way. IMO initial buffer should be very
small (a couple of bytes only) and be preallocated on endpoint creation.
This will solve locking problem.

--
Gleb.


Re: [OMPI devel] Threaded progress for CPCs

2008-05-19 Thread Gleb Natapov
On Mon, May 19, 2008 at 01:52:22PM -0500, Jon Mason wrote:
> On Mon, May 19, 2008 at 05:17:57PM +0300, Gleb Natapov wrote:
> > On Mon, May 19, 2008 at 05:08:17PM +0300, Pavel Shamis (Pasha) wrote:
> > > >> 5. ...?
> > > >> 
> > > > What about moving posting of receive buffers into main thread. With
> > > > SRQ it is easy: don't post anything in CPC thread. Main thread will
> > > > prepost buffers automatically after first fragment received on the
> > > > endpoint (in btl_openib_handle_incoming()). 
> > > It still doesn't guaranty that we will not see RNR (as I understand we 
> > > trying to resolve this problem  for iwarp?!)
> > > 
> > I don't think that iwarp has SRQ at all. And if it has then it should
> 
> While Chelsio does not currently have an adapter that has SRQs, there are
> some other iWARP vendors that do have them.
> 
> > have HW flow control for it too. I don't see what advantage SRQ without
> > flow control can provide over PPRQ.
> 
> Technically, this is not flow control, it is a retransmit.  iWARP can use
> the HW TCP stack to retransmit, but it will not have the "retransmit
> forever" ability that setting rnr_retry to 7 has for IB.
For how long will it try to retransmit before dropping connection.

> 
> > > So this solution will cost 1 buffer on each srq ... sounds acceptable 
> > > for me. But I don't see too much
> > > difference compared to #1, as I understand  we anyway will be need the 
> > > pipe for communication with main thread.
> > > so why don't use #1 ?
> > What communication? No communication at all. Just don't prepost buffers
> > to SRQ during connection establishment. Problem solved (only for SRQ of
> > cause).
> 
> iWARP needs preposted recv buffers (or it will drop the connection).  So
> this isn't a good option.
I was talking about SRQ only. You said above that iwarp does retransmit for SRQ.
openib BTL relies on HW retransmit when using SRQ, so if iwarp doesn't do it
reliably enough it can not be used with SRQ anyway.

--
Gleb.


Re: [OMPI devel] Threaded progress for CPCs

2008-05-19 Thread Gleb Natapov
On Mon, May 19, 2008 at 07:39:13PM +0300, Pavel Shamis (Pasha) wrote:
 So this solution will cost 1 buffer on each srq ... sounds  
 acceptable for me. But I don't see too much
 difference compared to #1, as I understand  we anyway will be need  
 the pipe for communication with main thread.
 so why don't use #1 ?
 
>>> What communication? No communication at all. Just don't prepost buffers
>>> to SRQ during connection establishment. Problem solved (only for SRQ of
>>> cause).  
> As i know Jeff use the pipe for some status update (Jeff, please correct  
> me if  I wrong).
> If we still need pipe for communication , I prefer #1.
> If we don't have the pipe , I prefer your solution
>
The pipe will still be there. The pipe itself is not the problem. The
problem is that currently initial post_receives are done in the CPC
thread. post_receives involves access to some data structures that are
used in the main thread too (free lists, mpool, SRQ) so it has to be
either protected or eliminated. I think that eliminating it is a better
solution for now. For SRQ case it is also easy to do. PPRQ is more
complicated but IMHO possible.

--
Gleb.


Re: [OMPI devel] Threaded progress for CPCs

2008-05-19 Thread Gleb Natapov
On Mon, May 19, 2008 at 05:08:17PM +0300, Pavel Shamis (Pasha) wrote:
> >> 5. ...?
> >> 
> > What about moving posting of receive buffers into main thread. With
> > SRQ it is easy: don't post anything in CPC thread. Main thread will
> > prepost buffers automatically after first fragment received on the
> > endpoint (in btl_openib_handle_incoming()). 
> It still doesn't guaranty that we will not see RNR (as I understand we 
> trying to resolve this problem  for iwarp?!)
> 
I don't think that iwarp has SRQ at all. And if it has then it should
have HW flow control for it too. I don't see what advantage SRQ without
flow control can provide over PPRQ.

> So this solution will cost 1 buffer on each srq ... sounds acceptable 
> for me. But I don't see too much
> difference compared to #1, as I understand  we anyway will be need the 
> pipe for communication with main thread.
> so why don't use #1 ?
What communication? No communication at all. Just don't prepost buffers
to SRQ during connection establishment. Problem solved (only for SRQ of
cause).

--
Gleb.


Re: [OMPI devel] Threaded progress for CPCs

2008-05-19 Thread Gleb Natapov
On Sun, May 18, 2008 at 11:38:36AM -0400, Jeff Squyres wrote:
> ==> Remember that the goal for this work was to have a separate  
> progress thread *without* all the heavyweight OMPI thread locks.   
> Specifically: make it work in a build without --enable-progress- 
> threads or --enable-mpi-threads (we did some preliminary testing with  
> that stuff enabled and it had a big performance impact).
> 
> 1. When CM progress thread completes an incoming connection, it sends  
> a command down a pipe to the main thread indicating that a new  
> endpoint is ready to use.  The pipe message will be noticed by  
> opal_progress() in the main thread and will run a function to do all  
> necessary housekeeping (sets the endpoint state to CONNECTED, etc.).   
> But it is possible that the receiver process won't dip into the MPI  
> layer for a long time (and therefore not call opal_progress and the  
> housekeeping function).  Therefore, it is possible that with an active  
> sender and a slow receiver, the sender can overwhelm an SRQ.  On IB,  
> this will just generate RNRs and be ok (we configure SRQs to have  
> infinite RNRs), but I don't understand the semantics of what will  
> happen on iWARP (it may terminate?  I sent an off-list question to  
> Steve Wise to ask for detail -- we may have other issues with SRQ on  
> iWARP if this is the case, but let's skip that discussion for now).
> 
Is it possible to have sane SRQ implementation without HW flow control?
Anyway the described problem exists with SRQ right now too. If receiver
doesn't enter progress for a long time sender can overwhelm an SRQ.
I don't see how this can be fixed without progress thread (and I am not
even sure that this is the problem that has to be fixed).

> Even if we can get the iWARP semantics to work, this feels kinda  
> icky.  Perhaps I'm overreacting and this isn't a problem that needs to  
> be fixed -- after all, this situation is no different than what  
> happens after the initial connection, but it still feels icky.
What is so icky about it? Sender is faster than a receiver so flow control
kicks in.

> 
> 2. The CM progress thread posts its own receive buffers when creating  
> a QP (which is a necessary step in both CMs).  However, this is  
> problematic in two cases:
> 
[skip]

I don't like 1,2 and 3. :(

> 4. Have a separate mpool for drawing initial receive buffers for the
> CM-posted RQs.  We'd probably want this mpool to be always empty (or
> close to empty) -- it's ok to be slow to allocate / register more
> memory when a new connection request arrives.  The memory obtained
> from this mpool should be able to be returned to the "main" mpool
> after it is consumed.

This is slightly better, but still...

> 5. ...?
What about moving posting of receive buffers into main thread. With
SRQ it is easy: don't post anything in CPC thread. Main thread will
prepost buffers automatically after first fragment received on the
endpoint (in btl_openib_handle_incoming()). With PPRQ it's more
complicated. What if we'll prepost dummy buffers (not from free list)
during IBCM connection stage and will run another three way handshake
protocol using those buffers, but from the main thread. We will need to
prepost one buffer on the active side and two buffers on the passive side.

--
Gleb.


Re: [OMPI devel] openib btl code review

2008-05-18 Thread Gleb Natapov
On Thu, May 15, 2008 at 11:58:02PM -0400, Jeff Squyres wrote:
> I updated the patch on https://svn.open-mpi.org/trac/ompi/ticket/1285  
> per Gleb's suggestions (I made a few commits tonight with some of the  
> non-receive-queues-patch-related fixes) and with some fixes for issues  
> that Nysal found.
> 
> Please see the most recent patch on the ticket.
Looks good to me.

> 
> 
> 
> On May 15, 2008, at 11:01 AM, Jeff Squyres wrote:
> 
> > On May 15, 2008, at 8:46 AM, Gleb Natapov wrote:
> >
> >>> Any other reviewers would be welcome...  :-)
> >> I'll look at it next week too.
> >
> > Thanks.
> >
> >>>> - some random style cleanup
> >>>> - fix a few minor memory leaks
> >
> > These two are the only ones that are really separate from the rest.
> >
> >>>> - adapt _ini.c to accept the "receive_queues" field in the file
> >>>> - move 90% of _setup_qps() from _ini.c to _component.c
> >>>> - move what was left of _setup_qps() into the main
> >>>> _register_mca_params() function
> >>>> - adapt init_one_hca() to detect conflicting receive_queues values
> >>>> from the INI file
> >>>> - after the _component.c loop calling init_one_hca():
> >>>> - call setup_qps() to parse the final receive_queues string value
> >>>> - traverse all resulting btls and initialize their HCAs (if they
> >>>> weren't already): setup some lists and call prepare_hca_for_use()
> >>>>
> >> It is better to have separate patch (and commit) for each of these
> >> items.
> >> Doing review and dialing with bugs is much easier this way.
> >
> >
> > I'll separate out the first two into separate fixes; I can even commit
> > those because they're pretty harmless and small.  FWIW: all of the
> > style changes were because I tried several approaches for the
> > receive_queues stuff before I found one that worked (i.e., I adapted
> > style of code that I touched, but then ended up reverting everything
> > except the style changes).
> >
> > -- 
> > Jeff Squyres
> > Cisco Systems
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> Cisco Systems
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] openib btl code review

2008-05-15 Thread Gleb Natapov
On Thu, May 15, 2008 at 08:14:29AM -0400, Jeff Squyres wrote:
> Pasha tells me he'll be able to review the patch next week, so I'll  
> wait to commit until then.  I added the patch to the ticket, just so  
> that it doesn't get lost.
> 
> Any other reviewers would be welcome...  :-)
I'll look at it next week too.

> > The attached patch does the following (Jon wrote part of this, too):
> >
> > - some random style cleanup
> > - fix a few minor memory leaks
> > - adapt _ini.c to accept the "receive_queues" field in the file
> > - move 90% of _setup_qps() from _ini.c to _component.c
> > - move what was left of _setup_qps() into the main  
> > _register_mca_params() function
> > - adapt init_one_hca() to detect conflicting receive_queues values  
> > from the INI file
> > - after the _component.c loop calling init_one_hca():
> >  - call setup_qps() to parse the final receive_queues string value
> >  - traverse all resulting btls and initialize their HCAs (if they  
> > weren't already): setup some lists and call prepare_hca_for_use()
> >
It is better to have separate patch (and commit) for each of these items.
Doing review and dialing with bugs is much easier this way.

--
Gleb.


Re: [OMPI devel] Unbelievable situation BUG

2008-04-27 Thread Gleb Natapov
On Sun, Apr 27, 2008 at 07:00:57PM +0300, Lenny Verkhovsky wrote:
> Hi, all 
> 
> I faced the "Unbelievable situation"
The situation is believable, but commit r18274, that adds this output, is
not, as it doesn't take into account sequence number wrap around.

> 
> during running IMB benchmark.
> 
>  
> 
>  
> 
> /home/USERS/lenny/OMPI_ORTE_LMC/bin/mpirun -np 96 --bynode  -hostfile
> hostfile_ompi -mca btl_openib_max_lmc 1 ./IMB-MPI1 PingPong PingPing
> Sendrecv Exchange Allreduce Reduce Reduce_scatter Bcast Barrier
> 
>  
> 
>  
> 
>  
> 
> #
> 
> # Benchmarking Allreduce
> 
> # #processes = 96
> 
> #
> 
> #Benchmarking#procs   #bytes #repetitions  t_min[usec]
> t_max[usec]  t_avg[usec]
> 
> Allreduce   96  0 1000 0.02
> 0.03 0.02
> 
> Allreduce   96  4 1000   297.88
> 298.07   297.95
> 
> Allreduce   96  8 1000   296.15
> 296.32   296.24
> 
> Allreduce   96 16 1000   297.99
> 298.17   298.09
> 
> Allreduce   96 32 1000   296.97
> 297.20   297.04
> 
> Allreduce   96 64 1000   298.43
> 298.64   298.49
> 
> Allreduce   96128 1000   296.86
> 297.07   296.93
> 
> Allreduce   96256 1000   298.00
> 298.30   298.09
> 
> Allreduce   96512 1000   296.79
> 296.96   296.85
> 
> Allreduce   96   1024 1000   299.23
> 299.39   299.31
> 
> Allreduce   96   2048 1000   295.51
> 295.64   295.57
> 
> Allreduce   96   4096 1000   246.02
> 246.13   246.08
> 
> Allreduce   96   8192 1000   492.52
> 492.74   492.63
> 
> Allreduce   96  16384 1000  5380.59
> 5381.47  5381.10
> 
> Allreduce   96  32768 1000  5372.86
> 5373.69  5373.36
> 
> Allreduce   96  65536  640  5470.41
> 5471.88  5471.16
> 
> Allreduce   96 131072  320  5554.52
> 5556.82  .75
> 
> [witch24:15639] Unbelievable situation ... we got a duplicated fragment
> with seq number of 0 (expected 65534) from witch23
> 
> [witch24:15639] Unbelievable situation ... we got a duplicated fragment
> with seq number of 65116 (expected 65534) from witch23
> 
> [witch24:15639] *** Process received signal ***
> 
> [witch24:15639] Signal: Segmentation fault (11)
> 
> [witch24:15639] Signal code: Address not mapped (1)
> 
> [witch24:15639] Failing at address: 0x632457d0
> 
> [witch24:15639] [ 0] /lib64/libpthread.so.0 [0x2b7929a9bc10]
> 
> [witch24:15639] [ 1]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_allocator_bucket.so
> [0x2b792aa47d34]
> 
> [witch24:15639] [ 2]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_pml_ob1.so
> [0x2b792b172163]
> 
> [witch24:15639] [ 3]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so
> [0x2b792b6b0772]
> 
> [witch24:15639] [ 4]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so
> [0x2b792b6b15ff]
> 
> [witch24:15639] [ 5]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_bml_r2.so
> [0x2b792b38307f]
> 
> [witch24:15639] [ 6]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/libopen-pal.so.0(opal_progress+0x4a)
> [0x2b79294cd16a]
> 
> [witch24:15639] [ 7] /home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0
> [0x2b79292163a8]
> 
> [witch24:15639] [ 8]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so
> [0x2b792c077cb7]
> 
> [witch24:15639] [ 9]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so
> [0x2b792c07b296]
> 
> [witch24:15639] [10]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0(PMPI_Allreduce+0x1e7)
> [0x2b7929229907]
> 
> [witch24:15639] [11] ./IMB-MPI1(IMB_allreduce+0x8e) [0x40764e]
> 
> [witch24:15639] [12] ./IMB-MPI1(main+0x3aa) [0x4034ea]
> 
> [witch24:15639] [13] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x2b7929bc2154]
> 
> [witch24:15639] [14] ./IMB-MPI1 [0x4030a9]
> 
> [witch24:15639] *** End of error message ***
> 
> 
> --
> 
> Best Regards,
> 
> Lenny.
> 
>  
> 

> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] Merging in the CPC work

2008-04-24 Thread Gleb Natapov
On Thu, Apr 24, 2008 at 11:50:10AM +0300, Pavel Shamis (Pasha) wrote:
> Jeff,
> All my tests fail.
> XRC disabled tests failed with:
> mtt/installs/Zq_9/install/lib/openmpi/mca_btl_openib.so: undefined 
> symbol: rdma_create_event_channel
> XRC enabled failed with segfault , I will take a look later today.
Well it is a little bit better for me. I compiled only OOB connection
manager and ompi passes simple testing.

> 
> Pasha
> 
> Jeff Squyres wrote:
> > As we discussed yesterday, I have started the merge from the /tmp- 
> > public/openib-cpc2 branch.  "oob" is currently the default.
> >
> > Unfortunately, it caused quite a few conflicts when I merged with the  
> > trunk, so I created a new temp branch and put all the work there: /tmp- 
> > public/openib-cpc3.
> >
> > Could all the IB and iWARP vendors and any other interested parties  
> > please try this branch before we bring it back to the trunk?  Please  
> > test all functionality that you care about -- XRC, etc.  I'd like to  
> > bring it back to the trunk COB Thursday.  Please let me know if this  
> > is too soon.
> >
> > You can force the selection of a different CPC with the  
> > btl_openib_cpc_include MCA param:
> >
> >  mpirun --mca btl_openib_cpc_include oob ...
> >  mpirun --mca btl_openib_cpc_include xoob ...
> >  mpirun --mca btl_openib_cpc_include rdma_cm ...
> >  mpirun --mca btl_openib_cpc_include ibcm ...
> >
> > You might want to concentrate on testing oob and xoob to ensure that  
> > we didn't cause any regressions.  The ibcm and rdma_cm CPCs probably  
> > still have some rough edges (and the IBCM package in OFED itself may  
> > not be 100% -- that's one of the things we're evaluating.  It's known  
> > to not install properly on RHEL4U4, for example -- you have to  
> > manually mknod and chmod a device in /dev/infiniband for every HCA in  
> > the host).
> >
> > Thanks.
> >
> >   
> 
> 
> -- 
> Pavel Shamis (Pasha)
> Mellanox Technologies
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] Affect of compression on modex and launch messages

2008-04-07 Thread Gleb Natapov
On Mon, Apr 07, 2008 at 07:54:38AM -0600, Ralph H Castain wrote:
> 
> 
> 
> On 4/7/08 7:45 AM, "Gleb Natapov"  wrote:
> 
> > On Mon, Apr 07, 2008 at 07:28:07AM -0600, Ralph H Castain wrote:
> >>> Also can you explain how
> >>> allgather is implemented in orte (sorry if you already explained this once
> >>> and I missed it).
> >> 
> >> The default method is for each proc to send its modex data to its local
> >> daemon. The local daemon collects the messages until all of its local procs
> >> have contributed, then sends the collected data to the rank=0 application
> >> proc. One rank=0 has received a message from every daemon, it xcasts the
> >> collected result to all procs in its job.
> >> 
> > Only collected result is compressed or messages from each proc to local
> > daemon and messages from local daemon to rank=0 are compressed too?
> 
> The individual inbound messages are not currently compressed prior to
> sending - too small to bother
Make sense.

> > Also I think if rank=0 will compress each modex message during
> > receive it can save some work during xcast.
> 
> Seems to me like one compress of the entire message has to be a great deal
> faster than N compressions of N small messages...
The idea is that modex receive and compress will overlap.

--
Gleb.


Re: [OMPI devel] Affect of compression on modex and launch messages

2008-04-07 Thread Gleb Natapov
On Mon, Apr 07, 2008 at 07:28:07AM -0600, Ralph H Castain wrote:
> > Also can you explain how
> > allgather is implemented in orte (sorry if you already explained this once
> > and I missed it).
> 
> The default method is for each proc to send its modex data to its local
> daemon. The local daemon collects the messages until all of its local procs
> have contributed, then sends the collected data to the rank=0 application
> proc. One rank=0 has received a message from every daemon, it xcasts the
> collected result to all procs in its job.
>
Only collected result is compressed or messages from each proc to local
daemon and messages from local daemon to rank=0 are compressed too?
And, may be a stupid question, but I have to ask :) When rank=0 xcast
collected modex it compress it once or for each rank separately.
Also I think if rank=0 will compress each modex message during
receive it can save some work during xcast.

--
Gleb.


Re: [OMPI devel] Affect of compression on modex and launch messages

2008-04-07 Thread Gleb Natapov
On Mon, Apr 07, 2008 at 07:07:38AM -0600, Ralph H Castain wrote:
> 
> 
> 
> On 4/7/08 7:04 AM, "Gleb Natapov"  wrote:
> 
> > On Fri, Apr 04, 2008 at 10:52:38AM -0600, Ralph H Castain wrote:
> >> With compression "on", you will get output telling you the original size of
> >> the message and its compressed size so you can see what was done.
> >> 
> > I see this output:
> > uncompressed allgather msg orig size 67521 compressed size 4162.
> > 
> > What is "allgather msg"
> 
> It is the modex message - it is "shared" across all the procs via an
> allgather procedure
> 
If I'll divide allgather msg size by number of processes I should get a
modex size of one process. Is this correct? Also can you explain how
allgather is implemented in orte (sorry if you already explained this once
and I missed it).

--
Gleb.


Re: [OMPI devel] Affect of compression on modex and launch messages

2008-04-07 Thread Gleb Natapov
On Fri, Apr 04, 2008 at 10:52:38AM -0600, Ralph H Castain wrote:
> With compression "on", you will get output telling you the original size of
> the message and its compressed size so you can see what was done.
> 
I see this output:
uncompressed allgather msg orig size 67521 compressed size 4162.

What is "allgather msg"

--
Gleb.


Re: [OMPI devel] RFC: changes to modex

2008-04-03 Thread Gleb Natapov
On Thu, Apr 03, 2008 at 07:05:28AM -0600, Ralph H Castain wrote:
> H...since I have no control nor involvement in what gets sent, perhaps I
> can be a disinterested third party. ;-)
> 
> Could you perhaps explain this comment:
> 
> > BTW I looked at how we do modex now on the trunk. For OOB case more
> > than half the data we send for each proc is garbage.
> 
> 
> What "garbage" are you referring to? I am working to remove the stuff
> inserted by proc.c - mostly hostname, hopefully arch, etc. If you are
> running a "debug" version, there will also be type descriptors for each
> entry, but those are eliminated for optimized builds.
> 
> So are you referring to other things?
I am talking about openib part of the modex. The "garbage" I am
referring to is this:

This is the structure that is sent by modex for each openib BTL. We send entire
structure by copying it into a message.
struct mca_btl_openib_port_info {
uint32_t mtu;
#if OMPI_ENABLE_HETEROGENEOUS_SUPPORT
uint8_t padding[4];
#endif
uint64_t subnet_id;
uint16_t lid; /* used only in xrc */
uint16_t apm_lid; /* the lid is used for APM to
 different port */
   char *cpclist;
};

The sizeof() the struct is 32 byte, but how much useful info it actually
contains?
mtu  - should really be uint8 since this is encoded value (1,2,3,4)
padding - is garbage.
sibnet_id - is ok
lid - should be sent only for XRC case
apm_lid - should be sent only if apm is enabled
cpclist - is pure garbage and should not be in this struct at all.

So we send 32 bytes with only 9 bytes of useful info (for non XRC case
without APM enabled).

--
Gleb.


Re: [OMPI devel] RFC: changes to modex

2008-04-03 Thread Gleb Natapov
On Wed, Apr 02, 2008 at 08:41:14PM -0400, Jeff Squyres wrote:
> >> that it's the same for all procs on all hosts.  I guess there's a few
> >> cases:
> >>
> >> 1. homogeneous include/exclude, no carto: send all in node info; no
> >> proc info
> >> 2. homogeneous include/exclude, carto is used: send all ports in node
> >> info; send index in proc info for which node info port index it  
> >> will use
> > This may actually increase modex size. Think about two procs using two
> > different hcas. We'll send all the data we send today + indexes.
> 
> It'll increase it compared to the optimization that we're about to  
> make.  But it will certainly be a large decrease compared to what  
> we're doing today

May be I don't understand something in what you propose then. Currently
when I run two procs on the same node and each proc uses different HCA
each one of them sends message that describes the HCA in use by the
proc. The message is of the form .
Each proc send one of those so there are two message total on the wire.
You propose that one of them should send description of both
available ports (that is one of them sends two messages of the form
above) and then each proc send additional message with the index of the
HCA that it is going to use. And this is more data on the wire after
proposed optimization than we have now.


>   (see the spreadsheet that I sent last week).
I've looked at it but I could not decipher it :( I don't understand
where all these numbers a come from.

> 
> Indeed, we can even put in the optimization that if there's only one  
> process on a host, it can only publish the ports that it will use (and  
> therefore there's no need for the proc data).
More special cases :(

> 
> >> 3. heterogeneous include/exclude, no cart: need user to tell us that
> >> this situation exists (e.g., use another MCA param), but then is same
> >> as #2
> >> 4. heterogeneous include/exclude, cart is used, same as #3
> >>
> >> Right?
> >>
> > Looks like it. FWIW I don't like the idea to code all those special
> > cases. The way it works now I can be pretty sure that any crazy setup
> > I'll come up with will work.
> 
> And so it will with the new scheme.  The only place it won't work is  
> if the user specifies a heterogeneous include/exclude (i.e., we'll  
> require that the user tells us when they do that), which nobody does.
> 
> I guess I don't see the problem...?
I like things to be simple. KISS principle I guess. And I do care about
heterogeneous include/exclude too.

BTW I looked at how we do modex now on the trunk. For OOB case more
than half the data we send for each proc is garbage.

> 
> > By the way how much data are moved during modex stage? What if modex
> > will use compression?
> 
> 
> The spreadsheet I listed was just the openib part of the modex, and it  
> was fairly hefty.  I have no idea how well (or not) it would compress.
> 
I looked at what kind of data we send during openib modex and I created
file with 1 openib modex messages. mtu, subnet id and cpc list where
the same in each message but lid/apm_lid where different, this is
pretty close approximation of the data that is sent from HN to each
process. The uncompressed file size is 489K compressed file size is 43K.
More then 10 times smaller.

--
Gleb.


Re: [OMPI devel] RFC: changes to modex

2008-04-02 Thread Gleb Natapov
On Wed, Apr 02, 2008 at 03:45:20PM -0400, Jeff Squyres wrote:
> On Apr 2, 2008, at 1:58 PM, Gleb Natapov wrote:
> >> No, I think it would be fine to only send the output after
> >> btl_openib_if_in|exclude is applied.  Perhaps we need an MCA param to
> >> say "always send everything" in the case that someone applies a non-
> >> homogeneous if_in|exclude set of values...?
> >>
> >> When is carto stuff applied?  Is that what you're really asking  
> >> about?
> >>
> > There is no difference between carto and include/exclude.
> 
> You mean in terms of when they are applied?
I mean that there are multiple ways to use different hca/port in
different proc on the same host.

> 
> > I can specify
> > different openib_if_include values for different procs on the same  
> > host.
> 
> 
> I know you *can*, but it is certainly uncommon.  The common case is  
Uncommon - yes, but do you what to make it unsupported?

> that it's the same for all procs on all hosts.  I guess there's a few  
> cases:
> 
> 1. homogeneous include/exclude, no carto: send all in node info; no  
> proc info
> 2. homogeneous include/exclude, carto is used: send all ports in node  
> info; send index in proc info for which node info port index it will use
This may actually increase modex size. Think about two procs using two
different hcas. We'll send all the data we send today + indexes.

> 3. heterogeneous include/exclude, no cart: need user to tell us that  
> this situation exists (e.g., use another MCA param), but then is same  
> as #2
> 4. heterogeneous include/exclude, cart is used, same as #3
> 
> Right?
> 
Looks like it. FWIW I don't like the idea to code all those special
cases. The way it works now I can be pretty sure that any crazy setup
I'll come up with will work.

By the way how much data are moved during modex stage? What if modex
will use compression?

--
Gleb.


Re: [OMPI devel] RFC: changes to modex

2008-04-02 Thread Gleb Natapov
On Wed, Apr 02, 2008 at 12:08:47PM -0400, Jeff Squyres wrote:
> On Apr 2, 2008, at 11:13 AM, Gleb Natapov wrote:
> > On Wed, Apr 02, 2008 at 10:35:03AM -0400, Jeff Squyres wrote:
> >> If we use carto to limit hcas/ports are used on a given host on a  
> >> per-
> >> proc basis, then we can include some proc_send data to say "this proc
> >> only uses indexes X,Y,Z from the node data".  The indexes can be
> >> either uint8_ts, or maybe even a variable length bitmap.
> >>
> > So you propose that each proc will send info (using node_send())  
> > about every
> > hca/proc on a host even about those that are excluded from use by  
> > the proc
> > just in case? And then each proc will have to send additional info  
> > (using
> > proc_send() this time) to indicate what hcas/ports it is actually  
> > using?
> 
> 
> No, I think it would be fine to only send the output after  
> btl_openib_if_in|exclude is applied.  Perhaps we need an MCA param to  
> say "always send everything" in the case that someone applies a non- 
> homogeneous if_in|exclude set of values...?
> 
> When is carto stuff applied?  Is that what you're really asking about?
> 
There is no difference between carto and include/exclude. I can specify
different openib_if_include values for different procs on the same host.

--
Gleb.


Re: [OMPI devel] RFC: changes to modex

2008-04-02 Thread Gleb Natapov
On Wed, Apr 02, 2008 at 10:35:03AM -0400, Jeff Squyres wrote:
> If we use carto to limit hcas/ports are used on a given host on a per- 
> proc basis, then we can include some proc_send data to say "this proc  
> only uses indexes X,Y,Z from the node data".  The indexes can be  
> either uint8_ts, or maybe even a variable length bitmap.
> 
So you propose that each proc will send info (using node_send()) about every
hca/proc on a host even about those that are excluded from use by the proc
just in case? And then each proc will have to send additional info (using
proc_send() this time) to indicate what hcas/ports it is actually using?

--
Gleb.


Re: [OMPI devel] RFC: changes to modex

2008-04-02 Thread Gleb Natapov
On Wed, Apr 02, 2008 at 10:21:12AM -0400, Jeff Squyres wrote:
>   * int ompi_modex_proc_send(...): send modex data that is specific to  
> this process.  It is just about exactly the same as the current API  
> call (ompi_modex_send).
> 
[skip]
> 
>   * int ompi_modex_node_send(...): send modex data that is relevant  
> for all processes in this job on this node.  It is intended that only  
> one process in a job on a node will call this function.  If more than  
> one process in a job on a node calls _node_send(), then only one will  
> "win" (meaning that the data sent by the others will be overwritten).
> 
In the case of openib BTL what part of modex are you going to send using
proc_send() and what part using node_send()?

--
Gleb.


Re: [OMPI devel] Switching away from SVN?

2008-03-24 Thread Gleb Natapov
On Fri, Mar 21, 2008 at 08:52:03AM -0400, Jeff Squyres wrote:
> Cool -- thanks Roland!
> 
> For anyone who wants to play with the entire history of OMPI in git  
> (as of last night or so -- this git repository is *not* being kept in  
> sync with SVN), I cloned the tree that Roland created and put it here:
> 
>  http://www.open-mpi.org/~jsquyres/unofficial/ompi.git
> 
> So you can:
> 
>  git clone http://www.open-mpi.org/~jsquyres/unofficial/ompi.git
> 
> And then work with local git operations from there.
It is very useful! Is it possible to sync it with SVN nightly?
The only problem I have with it is that git doesn't track empty
directories and autogen.sh fails without ompi/mca/io/romio/romio/confdb.

> 
> 
> 
> On Mar 20, 2008, at 8:53 PM, Roland Dreier wrote:
> >  has some interesting
> > info about svn->git conversions (and svn vs. next-gen distibuted
> > systems in general).
> >
> > Also, out of curiousity I tried doing
> >
> >git-svn clone --stdlayout http://svn.open-mpi.org/svn/ompi/
> >
> > and it seemed to work fine (git-svn is part of the main git
> > distribution).  The only obvious thing missing is that you would
> > probably want to set up an author file for a real conversion, so that
> > you get real names instead of just "jsquyres".  It took a while to
> > run, mostly because it has to grab each svn changeset one by one.
> >
> > The interesting thing is that a checkout of the current ompi tree
> > seems to be about 37 MB, while .git directory of my repository, which
> > has the entire history of all branches of the svn repository plus
> > 1.6MB of svn metadata is 36 MB.  And git can do fun stuff like
> >
> >git diff v1.1..v1.2
> >
> > in half a second (it generates a 274858 line diff).  It can generate
> > the full 116320 line (11164 commit) log of the trunk in .3 seconds.
> >
> > Jeff, if you want to see the repository, it is in
> >
> >/data/home/roland/ompi.git
> >
> > Feel free to make it available however you want (it's your data of  
> > course).
> >
> > - R.
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> Cisco Systems
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] OMPI OpenIB Credit Schema breaks Chelsio HW

2008-03-11 Thread Gleb Natapov
On Mon, Mar 10, 2008 at 01:52:22PM -0500, Steve Wise wrote:
> 
>Does OMPI do lazy dereg to maintain a cache of registered user buffers?

Not by default. You'll have to use -mca mpi_leave_pinned 1 to enable
lazy dereg.

--
Gleb.


Re: [OMPI devel] OMPI OpenIB Credit Schema breaks Chelsio HW

2008-03-10 Thread Gleb Natapov
On Mon, Mar 10, 2008 at 09:50:13AM -0500, Steve Wise wrote:
> > I personally don't like the idea to add another layer of complexity to 
> > openib
> > BTL code just to work around HW that doesn't follow spec. If work around
> > is simple that is OK, but in this case it is not so simple and will add
> > code path that is rarely tested. A simple workaround for the problem may
> > be to not configure multiple QPs if HW has a bug (and we can extend ini
> > file to contain this info).
> >
> >   
> 
> It doesn't sound too complex to implement the above design.  In fact, 
> that's the way this btl used to work, no?There are large customers 
> requesting OMPI over cxgb3 and we're ready to provide the effort to get 
> this done.  So I request we come to an agreement on how to support this 
> device efficiently (and for ompi-1.3).
Yes. The btl used to work like that before. But the current way of doing
credit management requires much less memory, so I don't think reverting
to the old way is a right thing. And having two different ways of
sending credit updates seems like additional complexity. The problem is
not just with writing code, but this code will have to be maintained for
unknown period of time (will this problem be solved in your next gen HW?).
I am OK with adding old fc in addition to current fc if the code is trivial
and easy to maintain. Do you think it is possible to add what you want
and meet these requirements?

--
Gleb.


Re: [OMPI devel] OMPI OpenIB Credit Schema breaks Chelsio HW

2008-03-09 Thread Gleb Natapov
On Sun, Mar 09, 2008 at 02:48:09PM -0500, Jon Mason wrote:
> Issue (as described by Steve Wise):
> 
> Currently OMPI uses qp 0 for all credit updates (by design).  This breaks
> when running over the chelsio rnic due to a race condition between
> advertising the availability of a buffer using qp0 when the buffer was
> posted on one of the other qps.  It is possible (and easily reproducible)
> that the peer gets the advertisement and sends data into the qp in question
> _before_ the rnic has processed the recv buffer and made it available for
> placement.  This results in a connection termination.  BTW, other hca's
> have this issue too.  ehca, for example, claims they have the same race
> condition.  I think the timing hole is much smaller though for devices that
> have 2 separate work queues for the SQ and RQ of a QP.  Chelsio has a
> single work queue to implement both SQ and RQ, so processing of RQ work
> requests gets queued up behind pending SQ entries which can make this race
> condition more prevalent.
There was a discussion about this on openfabrics mailing list and the
conclusion was that what Open MPI does is correct according to IB/iWarp
spec.

> 
> I don't know of any way to avoid this issue other that to ensure that all
> credit updates for qp X are posted only on qp X.  If we do this, then the
> chelsio HW/FW ensures that the RECV is posted before the subsequent send
> operation that advertises the buffer is processed.
Is it possible to fix your FW to follow iWarp spec? Perhaps it is
possible to implement ibv_post_recv() so that it will not return before
post receive is processed?

> To address this Jeff Squyres recommends:
> 
> 1. make an mca parameter that governs this behavior (i.e., whether to send
> all flow control messages on QP0 or on their respective QPs)
> 
> 2. extend the ini file parsing code to accept this parameter as well (need
> to add a strcmp or two)
> 
> 3. extend the ini file to fill in this value for all the nic's listed (to
> include yours).
> 
> 4. extend the logic in the rest of the btl to send the flow control
> messages either across qp0 or the respective qp, depending on the value of
> the mca param / ini value.
> 
> 
> I am happy to do the work to enable this, but I would like to get
> everyone's feed back before I start down this path.  Jeff said Gleb did
> the work to change openib to behave this way, so any insight would be
> helpful.
> 
I personally don't like the idea to add another layer of complexity to openib
BTL code just to work around HW that doesn't follow spec. If work around
is simple that is OK, but in this case it is not so simple and will add
code path that is rarely tested. A simple workaround for the problem may
be to not configure multiple QPs if HW has a bug (and we can extend ini
file to contain this info).

--
Gleb.


Re: [OMPI devel] orte can't launch process

2008-03-06 Thread Gleb Natapov
On Thu, Mar 06, 2008 at 07:49:13AM -0500, Tim Prins wrote:
> Sorry about that. I removed a field in a structure, then 'svn up' seems 
> to have added it back, so we were using a field that should not even 
> exist in a couple places.
> 
> Should be fixed in r17757
Works again. Thanks

--
Gleb.


[OMPI devel] orte can't launch process

2008-03-06 Thread Gleb Natapov
Something is broken in the trunk.

# mpirun -np 2 -H host1,host2  ./osu_latency
--
Some of the requested hosts are not included in the current allocation.

The requested hosts were specified with --host as:
host1,host2

Please check your allocation or your request.
--
--
mpirun was unable to start the specified application as it encountered
an error.
More information may be available above.
--

If I create hostfile with host1 and host2 and use it instead of -H
mpirun works.

--
Gleb.


Re: [OMPI devel] RDMA pipeline

2008-02-21 Thread Gleb Natapov
On Wed, Feb 20, 2008 at 04:08:46PM -0500, George Bosilca wrote:
> So I tracked this issue and it seems that the new behavior was  
> introduced one year ago by the commit 12433. Starting from this commit, 
Except that the log message of this commit says:

   Fix regression from v1.1.
   1) make the code do what comment says
   2) if memory is prepinned don't send multiple PUT messages.

And to be absolutely sure I checked v1.1 and of cause there is no
pipeline for TCP BTLs there as well.

> there was no pipeline in the RDMA protocol. That make sense as we usually 
> don't use NetPipe all the time to check the performances of the message 
> logging (we use real applications). However, last week, we did a NetPipe 
> and that's how we realized the lack of pipelining for the RDMA case.
Perhaps at the time you wrote message logging you relied on buggy
behaviour that was later fixed.

>
> I would be in favor of having a consistent behavior everywhere. In other 
> words don't ask the user to know if there is or not an mpool associated 
> with a particular device, in order to figure out what protocol we use 
> internally. Actually, it's not only for users, it might help us as well.
>
User indeed shouldn't care what protocol we use as long as performance
is good. Pipeline is need to improve performance of some "insane"
interconnects that need memory pinning. The heuristic of OB1 is very simple:
if send and receive message buffers are pinned do not use pipeline (no matter
what interconnect is in use) otherwise use pipeline protocol to hide pinning
cost. The only assumption OB1 does is that if BTL has not MPOOL then all memory
is always pinned. Think about pipeline as slow path and no pipeline as fast 
path.
For Infiniband we use every dirty trick in the book (registration cache +
ptmalloc) to go fast path and you want TCP/MX/ELAN to always go slow
path! This doesn't make sense to me.

If you need pipeline in OB1 to hide message logging cost we may add another 
config
parameter that will always enable pipeline. We may even not expose it to
users, but set it automatically if message logging is enabled.


>   Thanks,
> george.
>
> On Feb 20, 2008, at 4:29 AM, Gleb Natapov wrote:
>
>> On Tue, Feb 19, 2008 at 10:40:46PM -0500, George Bosilca wrote:
>>> Actually, it restores the original behavior. The RDMA operations were
>>> pipelined before the r15247 commit, independent of the fact that they
>>> had mpool or not. We were actively using this behavior in the message
>>> logging framework to hide the cost of the local storage of the  
>>> payload,
>>> and we were quite surprised when we realized that it disappeared.
>> I checked v1.2 with tcp BTL (I can't test mx or elan, but tcp also
>> support RDMA and has no mpool) and no matter what  
>> btl_tcp_max_rdma_size
>> I provide the whole buffer is sent in one rdma operation. And here is
>> explanation why this happens:
>> 1. If BTL is RDMA capable but does not provide mpool
>> mca_pml_ob1_rdma_btls() assumes that memory is always registered. This
>> function will always return non zero value for any buffer it is called
>> with in our case.
>>
>> 2. When mca_pml_ob1_send_request_start_btl() chooses what function to
>> use for rendezvous send it checks if buffer is contiguous and if it is
>> then it check if buffer is already registered by checking non zero  
>> value
>> returned by mca_pml_ob1_rdma_btls() and for BTLs without mpool
>> mca_pml_ob1_send_request_start_rdma() is always chosen.
>>
>> 3. Receiver checks if local buffer is registered by calling
>> mca_pml_ob1_rdma_btls() on it (see pml_ob1_recvreq.c:259):
>>
>>  recvreq->req_rdma_cnt = mca_pml_ob1_rdma_btls(
>>  bml_endpoint,
>>  (unsigned char*) base,
>>  recvreq->req_recv.req_bytes_packed,
>>  recvreq->req_rdma);
>> So recvreq->req_rdma_cnt is set to non zero value (if receive buffer  
>> is
>> contiguous of cause).
>>
>> 4. Receiver send PUT messages to a senders in
>> mca_pml_ob1_recv_request_schedule_exclusive(). Here is the code snip
>> from the function (see pml_ob1_recvreq.c:684):
>>
>>   /* makes sure that we don't exceed BTL max rdma size
>>* if memory is not pinned already */
>>   if(0 == recvreq->req_rdma_cnt &&
>> bml_btl->btl_max_rdma_size != 0 &&
>> size > bml_btl->btl_max_rdma_size)
>>   {
>>
>>   size = bml_btl->btl_max_rdma_size;
>>   }
>> Pay special attention to a comment. If recvreq->req_rdma_cnt is not
>> zero btl_max_rdma_size is i

Re: [OMPI devel] RDMA pipeline

2008-02-20 Thread Gleb Natapov
On Tue, Feb 19, 2008 at 10:40:46PM -0500, George Bosilca wrote:
> Actually, it restores the original behavior. The RDMA operations were  
> pipelined before the r15247 commit, independent of the fact that they  
> had mpool or not. We were actively using this behavior in the message  
> logging framework to hide the cost of the local storage of the payload, 
> and we were quite surprised when we realized that it disappeared.
I checked v1.2 with tcp BTL (I can't test mx or elan, but tcp also
support RDMA and has no mpool) and no matter what btl_tcp_max_rdma_size
I provide the whole buffer is sent in one rdma operation. And here is
explanation why this happens:
 1. If BTL is RDMA capable but does not provide mpool
 mca_pml_ob1_rdma_btls() assumes that memory is always registered. This
 function will always return non zero value for any buffer it is called
 with in our case.

 2. When mca_pml_ob1_send_request_start_btl() chooses what function to
 use for rendezvous send it checks if buffer is contiguous and if it is
 then it check if buffer is already registered by checking non zero value
 returned by mca_pml_ob1_rdma_btls() and for BTLs without mpool
 mca_pml_ob1_send_request_start_rdma() is always chosen.

 3. Receiver checks if local buffer is registered by calling
 mca_pml_ob1_rdma_btls() on it (see pml_ob1_recvreq.c:259):

  recvreq->req_rdma_cnt = mca_pml_ob1_rdma_btls(
  bml_endpoint,
  (unsigned char*) base,
  recvreq->req_recv.req_bytes_packed,
  recvreq->req_rdma);
 So recvreq->req_rdma_cnt is set to non zero value (if receive buffer is
 contiguous of cause).

 4. Receiver send PUT messages to a senders in
 mca_pml_ob1_recv_request_schedule_exclusive(). Here is the code snip
 from the function (see pml_ob1_recvreq.c:684):

   /* makes sure that we don't exceed BTL max rdma size
* if memory is not pinned already */
   if(0 == recvreq->req_rdma_cnt &&
 bml_btl->btl_max_rdma_size != 0 &&
 size > bml_btl->btl_max_rdma_size)
   {

   size = bml_btl->btl_max_rdma_size;
   }
 Pay special attention to a comment. If recvreq->req_rdma_cnt is not
 zero btl_max_rdma_size is ignored and message is send by one big RDMA
 operation.

So what I have shown here is that there was no pipeline for TCP btl in
v1.2 and that the code specifically written to behave this way.
If you still think that there is a difference in behaviour between v1.2
and the trunk can you explain what code path is executed in v1.2 for
your test case and how trunk behaves differently.

>
> If a BTL don't want to use pipeline for RDMA operations, it can set the 
> RDMA fragment size to the max value, and this will automatically disable 
> the pipeline. However, if the BTL support pipeline with the trunk version 
> today it is not possible to activate it. Moreover, in the current version 
> the parameters that define the BTL behavior are blatantly ignored, as the 
> PML make high level assumption about what they want to do.
I am not defending current behaviour. If you want to change it we can
discuss exact semantics that you want to see. But before that I what to
make sure that trunk is indeed different from v1.2 in this regard like
you claim it to be. Can you provide me with a test case that works
differently in v1.2 and the trunk?

--
Gleb.


Re: [OMPI devel] RDMA pipeline

2008-02-19 Thread Gleb Natapov
On Tue, Feb 19, 2008 at 02:13:30PM -0500, George Bosilca wrote:
> Few days ago during some testing I realize that the RDMA pipeline was  
> disabled for MX and Elan (I didn't check for the others). A quick look  
> into the source code, pinpointed the problem into the pml_ob1_rdma.c  
> file, and it seems that the problem was introduced by commit 15247. The 
> problem comes from the usage of the dummy registration, which is set for 
> all non mpool friendly BTL. Later on this is checked against NULL (and of 
> course it fails), which basically disable the RDMA pipeline.
Do you mean that mca_pml_ob1_send_request_start_rdma() is used for
rendezvous sends? I will be very surprised if ompi 1.2 works
differently. It assumes that if btl has no mpool then entire message buffer
is registered and no pipeline is needed. Trunk does the same but
differently. OpenIB also choose this route if buffer memory is allocated
by MPI_alloc_mem().

>
> I'll enable the RDMA pipeline back in 2 days if I don't hear anything  
> back. Attached is the patch that fix this problem.
>
I am not sure why you need pipeline for BTLs that don't require
registration, but by applying this patch you'll change how ompi behaves
from v1.0. (unless I miss something, then please provide more
explanations).

--
Gleb.


Re: [OMPI devel] btl_openib_rnr_retry MCA param

2008-02-13 Thread Gleb Natapov
On Wed, Feb 13, 2008 at 09:05:24AM -0500, Jeff Squyres wrote:
> Actually, we should then also print out a different error message when  
> RNR occurs in PP QP's, too.  It should be something along the lines of  
> "flow control problem occurred; this shouldn't happen..." (right now  
> it says RNR happened, and goes into detail into what that means -- but  
> that's not the real problem).
> 
Good point.

> I'll do that as well.
Thanks!

> 
> 
> On Feb 13, 2008, at 12:59 AM, Gleb Natapov wrote:
> 
> > On Tue, Feb 12, 2008 at 05:41:13PM -0500, Jeff Squyres wrote:
> >> I see that in the OOB CPC for the openib BTL, when setting up the  
> >> send
> >> side of the QP, we set the rnr_retry value depending on whether the
> >> remote receive queue is a per-peer or SRQ:
> >>
> >> - SRQ: btl_openib_rnr_retry MCA param value
> >> - PP: 0
> >>
> >> The rationale given in a comment is that setting the RNR to 0 is a
> >> good way to find bugs in our flow control.
> >>
> >> Do we really want this in production builds?  Or do we want 0 for
> >> developer builds and the same btl_openib_rnr_retry value for PP  
> >> queues?
> >>
> > The comment is mine and IMO it should stay that way for production
> > builds. SW flow control either work or it doesn't and if it doesn't I
> > prefer to know about it immediately. Setting PP to some value greater
> > then 0 just delays the manifestation of the problem and in the case of
> > iWarp such possibility doesn't even exists.
> >
> > --
> > Gleb.
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> Cisco Systems
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] [RFC] Remove explicit call to progress() from ob1.

2008-02-13 Thread Gleb Natapov
On Tue, Feb 12, 2008 at 05:57:22PM -0500, Jeff Squyres wrote:
> Were these supposed to cover the time required for pinning and  
> unpinning?
That what the comment says, but CPU executes code and not comments :)
Memory pinning happens inside prepare_dst() after prepare_dst() returns
the memory is already pinned. If you want to call progress after each
call to prepare_dst() you still can do it by setting recv_pipeline_depth
to 1. And unpinning happens in entirely different place after RDMA
completion is acknowledged.

> 
> Can you explain why you think they're unnecessary?
> 
The much better question is "Why they are necessary?", because if there
is not good answer to this question then they should be removed, since
they are harmful as they cause uncontrollable recursion calls.

> 
> On Feb 12, 2008, at 5:27 AM, Gleb Natapov wrote:
> 
> > Hi,
> >
> > I am planning to commit the following patch. Those two progress()  
> > calls
> > are responsible for most of our deep recursion troubles. And I also
> > think they are completely unnecessary.
> >
> > diff --git a/ompi/mca/pml/ob1/pml_ob1_recvreq.c b/ompi/mca/pml/ob1/ 
> > pml_ob1_recvreq.c
> > index 5899243..641176e 100644
> > --- a/ompi/mca/pml/ob1/pml_ob1_recvreq.c
> > +++ b/ompi/mca/pml/ob1/pml_ob1_recvreq.c
> > @@ -704,9 +704,6 @@ int mca_pml_ob1_recv_request_schedule_once(
> > mca_bml_base_free(bml_btl,dst);
> > continue;
> > }
> > -
> > -/* run progress as the prepare (pinning) can take some time  
> > */
> > -mca_bml.bml_progress();
> > }
> >
> > return OMPI_SUCCESS;
> > diff --git a/ompi/mca/pml/ob1/pml_ob1_sendreq.c b/ompi/mca/pml/ob1/ 
> > pml_ob1_sendreq.c
> > index 0998a05..9d7f3f9 100644
> > --- a/ompi/mca/pml/ob1/pml_ob1_sendreq.c
> > +++ b/ompi/mca/pml/ob1/pml_ob1_sendreq.c
> > @@ -968,7 +968,6 @@ cannot_pack:
> > mca_bml_base_free(bml_btl,des);
> > continue;
> > }
> > -mca_bml.bml_progress();
> > }
> >
> > return OMPI_SUCCESS;
> > --
> > Gleb.
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> Cisco Systems
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] btl_openib_rnr_retry MCA param

2008-02-13 Thread Gleb Natapov
On Tue, Feb 12, 2008 at 05:41:13PM -0500, Jeff Squyres wrote:
> I see that in the OOB CPC for the openib BTL, when setting up the send  
> side of the QP, we set the rnr_retry value depending on whether the  
> remote receive queue is a per-peer or SRQ:
> 
> - SRQ: btl_openib_rnr_retry MCA param value
> - PP: 0
> 
> The rationale given in a comment is that setting the RNR to 0 is a  
> good way to find bugs in our flow control.
> 
> Do we really want this in production builds?  Or do we want 0 for  
> developer builds and the same btl_openib_rnr_retry value for PP queues?
> 
The comment is mine and IMO it should stay that way for production
builds. SW flow control either work or it doesn't and if it doesn't I
prefer to know about it immediately. Setting PP to some value greater
then 0 just delays the manifestation of the problem and in the case of
iWarp such possibility doesn't even exists.

--
Gleb.


Re: [OMPI devel] Something wrong with vt?

2008-02-12 Thread Gleb Natapov
On Tue, Feb 12, 2008 at 01:08:32PM +0100, Matthias Jurenz wrote:
> Hi Gleb,
> 
> that's very strange... cause' the corresponding 'Makefile.in' is
> definitely not empty (checked in to the SVN repository).
Ah, here is the problem. Makefile.in is empty in my tree. I am building
not from SVN checkout, but from the other source tree that is synced
with SVN checkout and the sync process consider Makefile.in files as
generated and ignores them. Why  Makefiles.in is not regenerated by
autogen.sh in vt sources?


> Could you reproduce this error after 'make distclean, configure, make' ?
> Which version of the autotools are you using?
> 
> 
> Matthias
> 
> On Mo, 2008-02-11 at 11:42 +0200, Gleb Natapov wrote:
> 
> > I get the following error while "make install":
> > 
> > make[2]: Entering directory `/home_local/glebn/build_dbg/ompi/contrib/vt'
> > Making install in vt
> > make[3]: Entering directory `/home_local/glebn/build_dbg/ompi/contrib/vt/vt'
> > make[3]: *** No rule to make target `install'.  Stop.
> > make[3]: Leaving directory `/home_local/glebn/build_dbg/ompi/contrib/vt/vt'
> > make[2]: *** [install-recursive] Error 1
> > make[2]: Leaving directory `/home_local/glebn/build_dbg/ompi/contrib/vt'
> > make[1]: *** [install-recursive] Error 1
> > make[1]: Leaving directory `/home_local/glebn/build_dbg/ompi'
> > make: *** [install-recursive] Error 1
> > 
> > ompi/contrib/vt/vt/Makefile is empty!
> > --
> > Gleb.
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> 
> --
> Matthias Jurenz,
> Center for Information Services and 
> High Performance Computing (ZIH), TU Dresden, 
> Willersbau A106, Zellescher Weg 12, 01062 Dresden
> phone +49-351-463-31945, fax +49-351-463-37773



> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


[OMPI devel] [RFC] Remove explicit call to progress() from ob1.

2008-02-12 Thread Gleb Natapov
Hi,

I am planning to commit the following patch. Those two progress() calls
are responsible for most of our deep recursion troubles. And I also
think they are completely unnecessary.

diff --git a/ompi/mca/pml/ob1/pml_ob1_recvreq.c 
b/ompi/mca/pml/ob1/pml_ob1_recvreq.c
index 5899243..641176e 100644
--- a/ompi/mca/pml/ob1/pml_ob1_recvreq.c
+++ b/ompi/mca/pml/ob1/pml_ob1_recvreq.c
@@ -704,9 +704,6 @@ int mca_pml_ob1_recv_request_schedule_once(
 mca_bml_base_free(bml_btl,dst);
 continue;
 }
-
-/* run progress as the prepare (pinning) can take some time */
-mca_bml.bml_progress();
 }

 return OMPI_SUCCESS;
diff --git a/ompi/mca/pml/ob1/pml_ob1_sendreq.c 
b/ompi/mca/pml/ob1/pml_ob1_sendreq.c
index 0998a05..9d7f3f9 100644
--- a/ompi/mca/pml/ob1/pml_ob1_sendreq.c
+++ b/ompi/mca/pml/ob1/pml_ob1_sendreq.c
@@ -968,7 +968,6 @@ cannot_pack:
 mca_bml_base_free(bml_btl,des);
 continue;
 }
-mca_bml.bml_progress();
 }

 return OMPI_SUCCESS;
--
Gleb.


[OMPI devel] Something wrong with vt?

2008-02-11 Thread Gleb Natapov
I get the following error while "make install":

make[2]: Entering directory `/home_local/glebn/build_dbg/ompi/contrib/vt'
Making install in vt
make[3]: Entering directory `/home_local/glebn/build_dbg/ompi/contrib/vt/vt'
make[3]: *** No rule to make target `install'.  Stop.
make[3]: Leaving directory `/home_local/glebn/build_dbg/ompi/contrib/vt/vt'
make[2]: *** [install-recursive] Error 1
make[2]: Leaving directory `/home_local/glebn/build_dbg/ompi/contrib/vt'
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory `/home_local/glebn/build_dbg/ompi'
make: *** [install-recursive] Error 1

ompi/contrib/vt/vt/Makefile is empty!
--
Gleb.


Re: [OMPI devel] 32 bit udapl warnings

2008-01-31 Thread Gleb Natapov
On Thu, Jan 31, 2008 at 08:45:54AM -0500, Don Kerr wrote:
> This was brought to my attention once before but I don't see this 
> message so I just plain forgot about it. :-(
> uDAPL defines its pointers as uint64, "typedef DAT_UINT64 DAT_VADDR", 
> and pval is a "void *" which is why the message comes up.  If I remove 
> the cast I believe I get a different warning and I just haven't stopped 
> to think of a way around this.
dat_pointer = (DAT_VADDR)(uintptr_t)void_pointer;

This is not just a warning. This is a real bug. If MSB of a void pointer
will be 1 it will be sign extended.

> 
> Tim Prins wrote:
> > Hi,
> >
> > I am seeing some warnings on the trunk when compiling udapl in 32 bit 
> > mode with OFED 1.2.5.1:
> >
> > btl_udapl.c: In function 'udapl_reg_mr':
> > btl_udapl.c:95: warning: cast from pointer to integer of different size
> > btl_udapl.c: In function 'mca_btl_udapl_alloc':
> > btl_udapl.c:852: warning: cast from pointer to integer of different size
> > btl_udapl.c: In function 'mca_btl_udapl_prepare_src':
> > btl_udapl.c:959: warning: cast from pointer to integer of different size
> > btl_udapl.c:1008: warning: cast from pointer to integer of different size
> > btl_udapl_component.c: In function 'mca_btl_udapl_component_progress':
> > btl_udapl_component.c:871: warning: cast from pointer to integer of 
> > different size
> > btl_udapl_endpoint.c: In function 'mca_btl_udapl_endpoint_write_eager':
> > btl_udapl_endpoint.c:130: warning: cast from pointer to integer of 
> > different size
> > btl_udapl_endpoint.c: In function 'mca_btl_udapl_endpoint_finish_max':
> > btl_udapl_endpoint.c:775: warning: cast from pointer to integer of 
> > different size
> > btl_udapl_endpoint.c: In function 'mca_btl_udapl_endpoint_post_recv':
> > btl_udapl_endpoint.c:864: warning: cast from pointer to integer of 
> > different size
> > btl_udapl_endpoint.c: In function 
> > 'mca_btl_udapl_endpoint_initialize_control_message':
> > btl_udapl_endpoint.c:1012: warning: cast from pointer to integer of 
> > different size
> >
> >
> > Thanks,
> >
> > Tim
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >   
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] open ib btl and xrc

2008-01-20 Thread Gleb Natapov
On Fri, Jan 18, 2008 at 11:43:03AM -0500, Jeff Squyres wrote:
> I think the main savings is that mellanox hardware works better when  
> fewer qp's are open.  I.e., it's a resource issue on the HCA, not  
> necessarily a savings in posting buffers to the qp.
Interesting. I hear this justification of XRC for the first time. It's
always was about decreasing memory consumption. As far as I know from
the tests we ran here QP cache on Mellanox HCA is small 10-12 QPs, so I
doubt XRC will help here, but maybe there is another threshold after
which performance drops even more.

> 
> But it's quite a complicated issue.  :-)
> 
> Gleb has some reservations about XRC; I'll let him expound on them...
My current "reservations" are not about XRC per se, but about how OFED
became to be a platform for Mellanox to push things to the world without
any serious reviews. I don't really care about 1 things that goes
into OFED without going into the linux kernel first as long as they are
not change/define interfaces. Upcoming OFED 1.3 will include XRC interface
without any feedback from linux kernel developers. What if interface will
have to be changed in order to be included into the linux kernel? Do you
remember why PLPA exists? Because some distribution hurried to include
process affinity before interface was finalized. Same thing are happening
here. But this discussion is not for this list :)

> 
> 
> 
> On Jan 18, 2008, at 1:06 AM, Don Kerr wrote:
> 
> > Those pointers were perfect thanks.
> >
> > It easy to see the benefit of fewer qps (per node instead of per peer)
> > and less consumption of resources the better but I am curious about  
> > the
> > actual percentage of memory footprint decrease. I am thinking that the
> > largest portion of the footprint comes from the fragments. Do you have
> > any numbers showing the actual memory footprint savings when using  
> > xrc?
> > Just to be clear, I am not asking for you or anyone else to generate
> > these numbers, but if you had them already I would be curious to know
> > the over all savings.
> >
> > -DON
> >
> > Pavel Shamis (Pasha) wrote:
> >> Here is paper from openib http://www.openib.org/archives/nov2007sc/XRC.pdf
> >> and here is mvapich presentation
> >> http://mvapich.cse.ohio-state.edu/publications/ofa_nov07-mvapich-xrc.pdf
> >>
> >> Button line: XRC decrease number of QPs that ompi opens and as result
> >> decrease ompi's memory footprint.
> >> In the openib paper you may see more details about XRC. If you need  
> >> more
> >> details about XRC implemention
> >> in openib blt , please let me know.
> >>
> >>
> >> Instead
> >> Don Kerr wrote:
> >>
> >>> Hi,
> >>>
> >>> After searching, about the only thing I can find on xrc is what it
> >>> stands for, can someone explain the benefits of open mpi's use of  
> >>> xrc,
> >>> maybe point me to a paper, or both?
> >>>
> >>> TIA
> >>> -DON
> >>>
> >>> ___
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>
> >>>
> >>>
> >>
> >>
> >>
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> Cisco Systems
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] [PATCH] openib btl: extensable cpcselection enablement

2008-01-14 Thread Gleb Natapov
On Mon, Jan 14, 2008 at 08:15:23AM -0500, Jeff Squyres (jsquyres) wrote:
> Any obj to bringing this stuff to the trunk?  The moden string opt stuff can 
> be done directly on the trunk imo.
Go ahead.

--
Gleb.


[OMPI devel] ptmalloc and pin down cache problems again

2008-01-07 Thread Gleb Natapov
Hi Brian,

 I encountered problem with ptmalloc an registration cache. I see that
you (I think it was you) disabled shrinking of a heap memory allocated
by sbrk by setting MORECORE_CANNOT_TRIM to 1. The comment explains that
it should be done because freeing of small objects is not reentrant so
if ompi memory subsystem callback will call free() the code will deadlock.
And the trick indeed works in single threaded programs, but in multithreaded
programs ptmalloc may allocate a heap not only by sbrk, but by mmap too. This
is called "arena". Each thread may have arenas of its own. The problem is
that ptmalloc may free an arena by calling munmap() and then free() that
is called from our callback deadlocks. I tried to compile with USE_ARENAS set
to 0, but the code doesn't compile. I can fix compilation problem of
cause, but it seems that it is not so good idea to disable this feature.
The ptmalloc scalability depends on it (and even if we will disable it
ptmalloc may still create arena by mmap if sbrk fails). I see only one
way to solve this problem: to not call free() inside mpool callbacks.
If freeing of a memory is needed (and it is needed since IB unregister
calls free()) the works should be deferred. For IB mpool we can check what
needs to be unregistered inside a callback, but actually call unregister()
from next mpool->register() call. Do you see any problems with this
approach?

--
Gleb.


Re: [OMPI devel] Common initialization code for IB.

2008-01-07 Thread Gleb Natapov
On Thu, Jan 03, 2008 at 09:27:14AM -0500, Jeff Squyres wrote:
> > Another
> > problem is how multicast collective knows that all processes in a
> > communicator are reachable via the same network, do we have a  
> > mechanism
> > in ompi to check this?
> 
> 
> Good question.
> 
> Perhaps the common_of stuff could hang some data off the ompi_proc_t  
> that can be read by any of-like component (btl openib, coll of  
> multicast, etc.)...?  This could contain a subnet ID, or perhaps a  
> reachable flag, or somesuch.
> 
But we calculate reachability inside BTL at modex stage so if HCA is not
used by BTL there is no reachability info for it.

--
Gleb.


[OMPI devel] Common initialization code for IB.

2008-01-03 Thread Gleb Natapov
Hi,

  In Paris we've talked about putting HCA discovery and initialization code
outside of openib BTL so other components that want to use IB will be able
to share common code, data and registration cache. Other components I am
thinking about are ofud and multicast collectives. I started to look at
this and I have a couple of problems with this approach. Currently openib
BTL has if_include/if_exclude parameters to control which HCAs should be
used. Should we make those parameters global and initialize only HCAs
that are not exulted by those filters, or should we initialize all HCAs
and each component will have its own include/exclude filters? Another
problem is how multicast collective knows that all processes in a
communicator are reachable via the same network, do we have a mechanism
in ompi to check this?

--
Gleb.


Re: [OMPI devel] [ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process

2007-12-25 Thread Gleb Natapov
On Mon, Dec 24, 2007 at 11:49:37PM +, Tang, Changqing wrote:
> 
> 
> > -Original Message-
> > From: Pavel Shamis (Pasha) [mailto:pa...@dev.mellanox.co.il]
> > Sent: Monday, December 24, 2007 8:03 AM
> > To: Tang, Changqing
> > Cc: Jack Morgenstein; Roland Dreier;
> > gene...@lists.openfabrics.org; Open MPI Developers;
> > mvapich-disc...@cse.ohio-state.edu
> > Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP
> > independent of any one user process
> >
> > Hi CQ,
> > Tang, Changqing wrote:
> > > If I have a MPI server processes on a node, many other MPI
> > > client processes will dynamically connect/disconnect with
> > the server. The server use same XRC domain.
> > >
> > > Will this cause accumulating the "kernel" QP for such
> > > application ? we want the server to run 365 days a year.
> > >
> > I have some question about the scenario above. Did you call
> > for the mpi disconnect on the both ends (server/client)
> > before the client exit (did we must to do it?)
> 
> Yes, both ends will call disconnect. But for us, MPI_Comm_disconnect() call
> is not a collective call, it is just a local operation.
Bust spec says that MPI_Comm_disconnect() is a collective call:
http://www.mpi-forum.org/docs/mpi-20-html/node114.htm#Node114

> 
> --CQ
> 
> 
> >
> > Regards,
> > Pasha.
> > >
> > > Thanks.
> > > --CQ
> > >
> > >
> > >
> > >
> > >
> > >> -Original Message-
> > >> From: Pavel Shamis (Pasha) [mailto:pa...@dev.mellanox.co.il]
> > >> Sent: Thursday, December 20, 2007 9:15 AM
> > >> To: Jack Morgenstein
> > >> Cc: Tang, Changqing; Roland Dreier;
> > >> gene...@lists.openfabrics.org; Open MPI Developers;
> > >> mvapich-disc...@cse.ohio-state.edu
> > >> Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP
> > >> independent of any one user process
> > >>
> > >> Adding Open MPI and MVAPICH community to the thread.
> > >>
> > >> Pasha (Pavel Shamis)
> > >>
> > >> Jack Morgenstein wrote:
> > >>
> > >>> background:  see "XRC Cleanup order issue thread" at
> > >>>
> > >>>
> > >>>
> > >>>
> > >>
> > http://lists.openfabrics.org/pipermail/general/2007-December/043935.h
> > >> t
> > >>
> > >>> ml
> > >>>
> > >>> (userspace process which created the receiving XRC qp on a
> > >>>
> > >> given host
> > >>
> > >>> dies before other processes which still need to receive XRC
> > >>>
> > >> messages
> > >>
> > >>> on their SRQs which are "paired" with the now-destroyed
> > >>>
> > >> receiving XRC
> > >>
> > >>> QP.)
> > >>>
> > >>> Solution: Add a userspace verb (as part of the XRC suite) which
> > >>> enables the user process to create an XRC QP owned by the
> > >>>
> > >> kernel -- which belongs to the required XRC domain.
> > >>
> > >>> This QP will be destroyed when the XRC domain is closed
> > >>>
> > >> (i.e., as part
> > >>
> > >>> of a ibv_close_xrc_domain call, but only when the domain's
> > >>>
> > >> reference count goes to zero).
> > >>
> > >>> Below, I give the new userspace API for this function.  Any
> > >>>
> > >> feedback will be appreciated.
> > >>
> > >>> This API will be implemented in the upcoming OFED 1.3
> > >>>
> > >> release, so we need feedback ASAP.
> > >>
> > >>> Notes:
> > >>> 1. There is no query or destroy verb for this QP. There is
> > >>>
> > >> also no userspace object for the
> > >>
> > >>>QP. Userspace has ONLY the raw qp number to use when
> > >>>
> > >> creating the (X)RC connection.
> > >>
> > >>> 2. Since the QP is "owned" by kernel space, async events
> > >>>
> > >> for this QP are also handled in kernel
> > >>
> > >>>space (i.e., reported in /var/log/messages). There are
> > >>>
> > >> no completion events for the QP, since
> > >>
> > >>>it does not send, and all receives completions are
> > >>>
> > >> reported in the XRC SRQ's cq.
> > >>
> > >>>If this QP enters the error state, the remote QP which
> > >>>
> > >> sends will start receiving RETRY_EXCEEDED
> > >>
> > >>>errors, so the application will be aware of the failure.
> > >>>
> > >>> - Jack
> > >>>
> > >>>
> > >>
> > =
> > >> =
> > >>
> > >>> 
> > >>> /**
> > >>>  * ibv_alloc_xrc_rcv_qp - creates an XRC QP for serving as
> > >>>
> > >> a receive-side only QP,
> > >>
> > >>>  *and moves the created qp through the RESET->INIT and
> > >>>
> > >> INIT->RTR transitions.
> > >>
> > >>>  *  (The RTR->RTS transition is not needed, since this
> > >>>
> > >> QP does no sending).
> > >>
> > >>>  *The sending XRC QP uses this QP as destination, while
> > >>>
> > >> specifying an XRC SRQ
> > >>
> > >>>  *for actually receiving the transmissions and
> > >>>
> > >> generating all completions on the
> > >>
> > >>>  *receiving side.
> > >>>  *
> > >>>  *This QP is created in kernel space, and persists
> > >>>
> > >> until the XRC domain is closed.
> > >>
> > >>>  *(i.e., its reference count goes to zero).
> > >>>  *
> > >>>  * @pd: protection domain to use.  A

Re: [OMPI devel] openib xrc CPC minor nit

2007-12-21 Thread Gleb Natapov
On Thu, Dec 20, 2007 at 05:39:36PM -0500, Jeff Squyres wrote:
> Pasha --
> 
> I notice in the port info struct that you have a member for the lid,  
> but only #if HAVE_XRC.  Per a comment in the code, this is supposed to  
> save bytes when we're using OOB (because we don't need this value in  
> the OOB CPC).
> 
> I think we should remove this #if and always have this struct member.   
> ~4 extra bytes (because it's DSS packed) is no big deal.  It's packed  
> in with all the other modex info, so the message is already large.  4  
> more bytes per port won't make a difference (IMHO).
> 
> And keep in mind that #if HAVE_XRC is true if XRC is supported -- we  
> still send the extra bytes if XRC is supported and not used (which is  
> the default when compiling for OFED 1.3, no?).
> 
> So I think we should remove those #if's and just always have that data  
> member there.  It's up to the CPC's if they want to use that info or  
> not.
> 
> Any objections to me removing this #if on the openib-cpc branch?  (and  
> eventual merge back up to the trunk)
> 
Remove it, and add a capability mask to port info structure. Capability
will contain types of CPCs supported by a port. I may need this before
openib-cpc will be merged back to the trunk.

--
Gleb.


Re: [OMPI devel] matching code rewrite in OB1

2007-12-18 Thread Gleb Natapov
On Mon, Dec 17, 2007 at 08:08:02PM -0500, Richard Graham wrote:
>   Needless to say (for the nth time :-) ) that changing this bit of code
> makes me
>  nervous.
I've noticed it already :)

>However, it occurred to me that there is a much better way to
> test
>  this code than setting up an environment that generates some out of order
>  events with out us being able to specify the order.
>   Since this routine is executed serially, it should be sufficient to set up
> a test
>  code that would simulate any out-of-order scenario we want.  If one
> specifies
>  number of ³messages² to be ³sent², and ³randomly² changes the order they
>  arrive (e.g. scramble some input vector), one can check and see if the
> messages
>  are ³received² in the correct order.  One could even ³drop² messages and
> see
>  if matching stops.  Using a test code like this, and a code coverage tool,
> one
>  should be able to get much better testing that we have to date.
While I sometimes do unit testing for code that I write, in this case it is
easy to generate all reasonable corner case without isolating the code in a
separate unit. I run this code through different specially constructed MPI
application and checked code coverage with gcov. Here is the result:

File '/home/glebn/OpenMPI/ompi.stg/ompi/mca/pml/ob1/pml_ob1_recvfrag.c'
Lines executed:97.58% of 124

Only two lines of the code was never executes, both for error cases that
should cause abort anyway.
The pml_ob1_recvfrag.c.gcov with coverage results is attached for
however is curious enough to look at them. BTW I doubt that previous
code passed this level of testing. At least with gcov it is not possible
to generate meaningful results when most of the code is inside macros.


>   What would you think about doing something like this ?   Seems like a few
> hours
>  of this sort of simulation would be much better than even years of testing
> and
>  relying on random fluctuations in the run to thoroughly test out-of-order
> scenarios.
> 
> What do you think ?
I think that coverage testing I did is enough for this code.

> Rich
> 
> 
> On 12/17/07 8:32 AM, "Gleb Natapov"  wrote:
> 
> > On Thu, Dec 13, 2007 at 08:04:21PM -0500, Richard Graham wrote:
> >> > Yes, should be a bit more clear.  Need an independent way to verify that
> >> > data is matched
> >> >  in the correct order ­ sending this information as payload is one way to
> >> do
> >> > this.  So,
> >> >  sending unique data in every message, and making sure that it arrives in
> >> > the user buffers
> >> >  in the expected order is a way to do this.
> > 
> > Did that. Encoded sequence number in a payload and sent many eager
> > packets from one rank to another. Many packets were reoredered, but
> > application received everything in a correct order.
> > 
> > --
> > Gleb.
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> 
> 

> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.
-:
0:Source:/home/glebn/OpenMPI/ompi.stg/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
-:0:Graph:pml_ob1_recvfrag.gcno
-:0:Data:pml_ob1_recvfrag.gcda
-:0:Runs:8
-:0:Programs:1
-:0:Source is newer than graph
-:1:/*
-:2: * Copyright (c) 2004-2005 The Trustees of Indiana University 
and Indiana
-:3: * University Research and Technology
-:4: * Corporation.  All rights reserved.
-:5: * Copyright (c) 2004-2006 The University of Tennessee and The 
University
-:6: * of Tennessee Research Foundation.  
All rights
-:7: * reserved.
-:8: * Copyright (c) 2004-2007 High Performance Computing Center 
Stuttgart, 
-:9: * University of Stuttgart.  All rights 
reserved.
-:   10: * Copyright (c) 2004-2005 The Regents of the University of 
California.
-:   11: * All rights reserved.
-:   12: * $COPYRIGHT$
-:   13: * 
-:   14: * Additional copyrights may follow
-:   15: * 
-:   16: * $HEADER$
-:   17: */
-:   18:
-:   19:/**
-:   20: * @file
-:   21: */
-:   22:
-:   23:#include "ompi_config.h"
-:   24:
-:   25:#include "opal/class/

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r16969

2007-12-17 Thread Gleb Natapov
On Mon, Dec 17, 2007 at 10:53:26AM -0500, Jeff Squyres wrote:
> Gleb -
>
> Is this picture of the v1.3 long message params accurate?  (see attached)
Yes.

--
Gleb.


Re: [OMPI devel] matching code rewrite in OB1

2007-12-17 Thread Gleb Natapov
On Thu, Dec 13, 2007 at 08:04:21PM -0500, Richard Graham wrote:
> Yes, should be a bit more clear.  Need an independent way to verify that
> data is matched
>  in the correct order ­ sending this information as payload is one way to do
> this.  So,
>  sending unique data in every message, and making sure that it arrives in
> the user buffers
>  in the expected order is a way to do this.

Did that. Encoded sequence number in a payload and sent many eager
packets from one rank to another. Many packets were reoredered, but
application received everything in a correct order.

--
Gleb.



Re: [OMPI devel] rb rcache component

2007-12-15 Thread Gleb Natapov
On Sat, Dec 15, 2007 at 08:27:29AM -0500, Jeff Squyres wrote:
> It doesn't look like this component is used anymore  
> (it's .ompi_ignore'd).
> 
> Anyone object to svn rm'ing it on the trunk?
> 
Not me.

--
Gleb.


Re: [OMPI devel] New BTL parameter

2007-12-14 Thread Gleb Natapov
If there is no objection I will commit this to the trunk next week.

On Sun, Dec 09, 2007 at 05:34:30PM +0200, Gleb Natapov wrote:
> Hi,
> 
>   Currently BTL has parameter btl_min_send_size that is no longer used.
> I want to change it to be btl_rndv_eager_limit. This new parameter will
> determine a size of a first fragment of rendezvous protocol. Now we use
> btl_eager_limit to set its size. btl_rndv_eager_limit will have to be
> smaller or equal to btl_eager_limit. By default it will be equal to
> btl_eager_limit so no behavior change will be observed if default is
> used.
> 

--
Gleb.


Re: [OMPI devel] matching code rewrite in OB1

2007-12-14 Thread Gleb Natapov
On Fri, Dec 14, 2007 at 06:53:55AM -0500, Richard Graham wrote:
> If you have positive confirmation that such things have happened, this will
> go a long way.
I instrumented the code to log all kind of info about fragment reordering while
I chased a bug in openib that caused matching logic to malfunction. Any
non trivial application that uses OpenIB BTL will have reordered fragments.
(I wish this would not be that case, but I don't have a solution yet).

> I will not trust the code until this has also been done with
> multiple independent network paths. 
I ran IMB over IP and IB simultaneously on more then 80 ranks.

>  I very rarely express such strong
> opinions, even if I don't agree with what is being done, but this is the
> core of correct MPI functionality, and first hand experience has shown that
I agree that this is indeed very important piece of code, but it certain
is not more important than data type engine for instance (and it is much
easier to test all corner cases in matching logic than in data type engine
IMHO). And event if matching code works perfectly, but other parts of
OB1 are buggy the Open MPI will not work properly, so why this code is
chosen to be a sacred cow?

> just thinking through the logic, I can miss some of the race conditions.
That is of cause correct, but the more people will look at the code the
better, isn't it?

> The code here has been running for 8+ years in two production MPI's running
> on very large clusters, so I am very reluctant to make changes for what
Are you sure about this? I see a number of changes to this code during
Open MPI development and current SVN does not hold all the history of
this code unfortunately. Here is the list of commits that I found, part
of them change the code logic quite a bit:
r6770,r7342,r8339,r8352,r8353,r8356,r8946,r11874,r12323,r12582

> seems to amount to people's taste - maintenance is not an issue in this
> case.  Had this not been such a key bit of code, I would not even bat an
Why do you think that maintenance is not an issue? It is for me. Otherwise
I wouldn't even look at this part of code. All those macros prohibit the use
of a debugger for instance.

(And I see a small latency improvement too :))

> eye.  I suppose if you can go through some formal verification, this would
> also be good - actually better than hoping that one will hit out-of-order
> situations.
> 
> Rich
> 
> 
> On 12/14/07 2:20 AM, "Gleb Natapov"  wrote:
> 
> > On Thu, Dec 13, 2007 at 06:16:49PM -0500, Richard Graham wrote:
> >> The situation that needs to be triggered, just as George has mentions, is
> >> where we have a lot of unexpected messages, to make sure that when one that
> >> we can match against comes in, all the unexpected messages that can be
> >> matched with pre-posted receives are matched.  Since we attempt to match
> >> only when a new fragment comes in, we need to make sure that we don't leave
> >> other unexpected messages that can be matched in the unexpected queue, as
> >> these (if the out of order scenario is just right) would block any new
> >> matches from occurring.
> >> 
> >> For example:  Say the next expect message is 25
> >> 
> >> Unexpected message queue has:  26 28 29 ..
> >> 
> >> If 25 comes in, and is handled, if 26 is not pulled off the unexpected
> >> message queue, when 27 comes in it won't be able to be matched, as 26 is
> >> sitting in the unexpected queue, and will never be looked at again ...
> > This situation is triggered constantly with openib BTL. OpenIB BTL has
> > two ways to receive a packet: over a send queue or over an eager RDMA path.
> > Receiver polls both of them and may reorders packets locally. Actually
> > currently there is a bug in openib BTL that one channel may starve the other
> > at the receiver so if a match fragment with a next sequence number is in the
> > starved path tenth of thousands fragment can be reorederd. Test case 
> > attached
> > to ticket #1158 triggers this case and my patch handles all reordered 
> > packets.
> > 
> > And, by the way, the code is much simpler now and can be review easily ;)
> > 
> > --
> > Gleb.
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] matching code rewrite in OB1

2007-12-14 Thread Gleb Natapov
On Thu, Dec 13, 2007 at 06:16:49PM -0500, Richard Graham wrote:
> The situation that needs to be triggered, just as George has mentions, is
> where we have a lot of unexpected messages, to make sure that when one that
> we can match against comes in, all the unexpected messages that can be
> matched with pre-posted receives are matched.  Since we attempt to match
> only when a new fragment comes in, we need to make sure that we don't leave
> other unexpected messages that can be matched in the unexpected queue, as
> these (if the out of order scenario is just right) would block any new
> matches from occurring.
> 
> For example:  Say the next expect message is 25
> 
> Unexpected message queue has:  26 28 29 ..
> 
> If 25 comes in, and is handled, if 26 is not pulled off the unexpected
> message queue, when 27 comes in it won't be able to be matched, as 26 is
> sitting in the unexpected queue, and will never be looked at again ...
This situation is triggered constantly with openib BTL. OpenIB BTL has
two ways to receive a packet: over a send queue or over an eager RDMA path.
Receiver polls both of them and may reorders packets locally. Actually
currently there is a bug in openib BTL that one channel may starve the other
at the receiver so if a match fragment with a next sequence number is in the
starved path tenth of thousands fragment can be reorederd. Test case attached
to ticket #1158 triggers this case and my patch handles all reordered packets.

And, by the way, the code is much simpler now and can be review easily ;)

--
Gleb.


Re: [OMPI devel] [PATCH] openib: clean-up connect to allow for new cm's

2007-12-13 Thread Gleb Natapov
On Thu, Dec 13, 2007 at 10:49:45AM +0200, Pavel Shamis (Pasha) wrote:
>> Because we want to support mixed setups and create XRC between nodes that
>> support it and RC between all other nodes.
>>   
> Ok, sounds reasonable for me. Just need make sure that the parameters name 
> will be user friendly.
> Some thing like --mca enable-xrc that will cause to XOOB priority be 
> highest (and not something like --mca xoob 10 :-))
>
You can be sure that mca parameter name will be as cryptic as possible.

--
Gleb.


Re: [OMPI devel] matching code rewrite in OB1

2007-12-13 Thread Gleb Natapov
On Wed, Dec 12, 2007 at 03:10:10PM -0600, Brian W. Barrett wrote:
> On Wed, 12 Dec 2007, Gleb Natapov wrote:
> 
> > On Wed, Dec 12, 2007 at 03:46:10PM -0500, Richard Graham wrote:
> >> This is better than nothing, but really not very helpful for looking at the
> >> specific issues that can arise with this, unless these systems have several
> >> parallel networks, with tests that will generate a lot of parallel network
> >> traffic, and be able to self check for out-of-order received - i.e. this
> >> needs to be encoded into the payload for verification purposes.  There are
> >> some out-of-order scenarios that need to be generated and checked.  I think
> >> that George may have a system that will be good for this sort of testing.
> >>
> > I am running various test with multiple networks right now. I use
> > several IB BTLs and TCP BTL simultaneously. I see many reordered
> > messages and all tests were OK till now, but they don't encode
> > message sequence in a payload as far as I know. I'll change one of
> > them to do so.
> 
> Other than Rich's comment that we need sequence numbers, why add them?  We 
> haven't had them for non-matching packets for the last 3 years in Open MPI 
> (ie, forever), and I can't see why we would need them.  Yes, we need 
> sequence numbers for match headers to make sure MPI ordering is correct. 
> But for the rest of the payload, there's no need with OMPI's datatype 
> engine.  It's just more payload for no gain.
> 
As I understand what Rich propose he says that we need to construct
special test that will check that matching engine did its job right on
the application layer. In other words test should check that payload
received is correct one. He is not talking about adding additional
fields to OB1 header.

--
Gleb.


Re: [OMPI devel] New BTL parameter

2007-12-13 Thread Gleb Natapov
On Wed, Dec 12, 2007 at 01:18:10PM -0800, Paul H. Hargrove wrote:
> Gleb Natapov wrote:
> > On Wed, Dec 12, 2007 at 02:03:02PM -0500, Jeff Squyres wrote:
> >   
> >> On Dec 9, 2007, at 10:34 AM, Gleb Natapov wrote:
> >>
> >> 
> >>>  Currently BTL has parameter btl_min_send_size that is no longer used.
> >>> I want to change it to be btl_rndv_eager_limit. This new parameter  
> >>> will
> >>> determine a size of a first fragment of rendezvous protocol. Now we  
> >>> use
> >>> btl_eager_limit to set its size. btl_rndv_eager_limit will have to be
> >>> smaller or equal to btl_eager_limit. By default it will be equal to
> >>> btl_eager_limit so no behavior change will be observed if default is
> >>> used.
> >>>   
> >> Can you describe why it would be better to have the value less than  
> >> the eager limit?
> >>
> >> 
> > It is just one more knob to tune OB1 algorithm. I sometimes don't want
> > to send any data by copy in/out at all. This is not possible right now.
> > With this new param I will be able to control this.
> >   
> 
>  From my experience tuning RDMA-rendezvous for the GASNet communications 
> library, I know that it was beneficial to piggyback some portion of the 
> payload on the rendezvous request.  However, the best [insert your 
> favorite performance metric here] was not always achieved by 
> piggybacking the maximum that could be buffered at the receiver 
> (equivalent of blt_eager_limit).  If I understand correctly, Gleb's 
> btl_rndv_eager_limit parameter would allow tuning for this behavior in OMPI.
Exactly. You explained it better than me.

> 
> An artificial/simplified example would be if the eager limit is 32K and 
> you have a 64K xfer.  Is it better to send 32K copy in/out plus 32K by 
> RDMA, or to send 8K copy in/out plus 56K by RDMA?  If the memcpy() 
> overhead for 32K of eager payload exceeds what can be overlapped with 
> the rendezvous setup then the second may be the better choice (higher 
> bandwidth, lower latency, and lower CPU overheads on both sender and 
> receiver).
> 

--
Gleb.


Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Gleb Natapov
On Wed, Dec 12, 2007 at 03:52:17PM -0500, Jeff Squyres wrote:
> On Dec 12, 2007, at 3:20 PM, Gleb Natapov wrote:
> 
> >> How about making a tarball with this patch in it that can be thrown  
> >> at
> >> everyone's MTT? (we can put the tarball on www.open-mpi.org  
> >> somewhere)
> > I don't have access to www.open-mpi.org, but I can send you the patch.
> > I can send you a tarball too, but I prefer to not abuse email.
> 
> Do you have access to staging.openfabrics.org?  I could download it  
> from there and put it on www.open-mpi.org.
> 
No. I don't :(

--
Gleb.


Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Gleb Natapov
On Wed, Dec 12, 2007 at 03:46:10PM -0500, Richard Graham wrote:
> This is better than nothing, but really not very helpful for looking at the
> specific issues that can arise with this, unless these systems have several
> parallel networks, with tests that will generate a lot of parallel network
> traffic, and be able to self check for out-of-order received - i.e. this
> needs to be encoded into the payload for verification purposes.  There are
> some out-of-order scenarios that need to be generated and checked.  I think
> that George may have a system that will be good for this sort of testing.
> 
I am running various test with multiple networks right now. I use
several IB BTLs and TCP BTL simultaneously. I see many reordered
messages and all tests were OK till now, but they don't encode
message sequence in a payload as far as I know. I'll change one of
them to do so.

> Rich
> 
> 
> On 12/12/07 3:20 PM, "Gleb Natapov"  wrote:
> 
> > On Wed, Dec 12, 2007 at 11:57:11AM -0500, Jeff Squyres wrote:
> >> Gleb --
> >> 
> >> How about making a tarball with this patch in it that can be thrown at
> >> everyone's MTT? (we can put the tarball on www.open-mpi.org somewhere)
> > I don't have access to www.open-mpi.org, but I can send you the patch.
> > I can send you a tarball too, but I prefer to not abuse email.
> > 
> >> 
> >> 
> >> On Dec 11, 2007, at 4:14 PM, Richard Graham wrote:
> >> 
> >>> I will re-iterate my concern.  The code that is there now is mostly
> >>> nine
> >>> years old (with some mods made when it was brought over to Open
> >>> MPI).  It
> >>> took about 2 months of testing on systems with 5-13 way network
> >>> parallelism
> >>> to track down all KNOWN race conditions.  This code is at the center
> >>> of MPI
> >>> correctness, so I am VERY concerned about changing it w/o some very
> >>> strong
> >>> reasons.  Not apposed, just very cautious.
> >>> 
> >>> Rich
> >>> 
> >>> 
> >>> On 12/11/07 11:47 AM, "Gleb Natapov"  wrote:
> >>> 
> >>>> On Tue, Dec 11, 2007 at 08:36:42AM -0800, Andrew Friedley wrote:
> >>>>> Possibly, though I have results from a benchmark I've written
> >>>>> indicating
> >>>>> the reordering happens at the sender.  I believe I found it was
> >>>>> due to
> >>>>> the QP striping trick I use to get more bandwidth -- if you back
> >>>>> down to
> >>>>> one QP (there's a define in the code you can change), the reordering
> >>>>> rate drops.
> >>>> Ah, OK. My assumption was just from looking into code, so I may be
> >>>> wrong.
> >>>> 
> >>>>> 
> >>>>> Also I do not make any recursive calls to progress -- at least not
> >>>>> directly in the BTL; I can't speak for the upper layers.  The
> >>>>> reason I
> >>>>> do many completions at once is that it is a big help in turning
> >>>>> around
> >>>>> receive buffers, making it harder to run out of buffers and drop
> >>>>> frags.
> >>>>>  I want to say there was some performance benefit as well but I
> >>>>> can't
> >>>>> say for sure.
> >>>> Currently upper layers of Open MPI may call BTL progress function
> >>>> recursively. I hope this will change some day.
> >>>> 
> >>>>> 
> >>>>> Andrew
> >>>>> 
> >>>>> Gleb Natapov wrote:
> >>>>>> On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote:
> >>>>>>> Try UD, frags are reordered at a very high rate so should be a
> >>>>>>> good test.
> >>>>>> Good Idea I'll try this. BTW I thing the reason for such a high
> >>>>>> rate of
> >>>>>> reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions
> >>>>>> (500) and process them one by one and if progress function is
> >>>>>> called
> >>>>>> recursively next 500 completion will be reordered versus previous
> >>>>>> completions (reordering happens on a receiver, not sender).
> >>>>>> 
> >>>>>>> Andrew
> >>>>

Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Gleb Natapov
On Wed, Dec 12, 2007 at 11:57:11AM -0500, Jeff Squyres wrote:
> Gleb --
> 
> How about making a tarball with this patch in it that can be thrown at  
> everyone's MTT? (we can put the tarball on www.open-mpi.org somewhere)
I don't have access to www.open-mpi.org, but I can send you the patch.
I can send you a tarball too, but I prefer to not abuse email.

> 
> 
> On Dec 11, 2007, at 4:14 PM, Richard Graham wrote:
> 
> > I will re-iterate my concern.  The code that is there now is mostly  
> > nine
> > years old (with some mods made when it was brought over to Open  
> > MPI).  It
> > took about 2 months of testing on systems with 5-13 way network  
> > parallelism
> > to track down all KNOWN race conditions.  This code is at the center  
> > of MPI
> > correctness, so I am VERY concerned about changing it w/o some very  
> > strong
> > reasons.  Not apposed, just very cautious.
> >
> > Rich
> >
> >
> > On 12/11/07 11:47 AM, "Gleb Natapov"  wrote:
> >
> >> On Tue, Dec 11, 2007 at 08:36:42AM -0800, Andrew Friedley wrote:
> >>> Possibly, though I have results from a benchmark I've written  
> >>> indicating
> >>> the reordering happens at the sender.  I believe I found it was  
> >>> due to
> >>> the QP striping trick I use to get more bandwidth -- if you back  
> >>> down to
> >>> one QP (there's a define in the code you can change), the reordering
> >>> rate drops.
> >> Ah, OK. My assumption was just from looking into code, so I may be
> >> wrong.
> >>
> >>>
> >>> Also I do not make any recursive calls to progress -- at least not
> >>> directly in the BTL; I can't speak for the upper layers.  The  
> >>> reason I
> >>> do many completions at once is that it is a big help in turning  
> >>> around
> >>> receive buffers, making it harder to run out of buffers and drop  
> >>> frags.
> >>>  I want to say there was some performance benefit as well but I  
> >>> can't
> >>> say for sure.
> >> Currently upper layers of Open MPI may call BTL progress function
> >> recursively. I hope this will change some day.
> >>
> >>>
> >>> Andrew
> >>>
> >>> Gleb Natapov wrote:
> >>>> On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote:
> >>>>> Try UD, frags are reordered at a very high rate so should be a  
> >>>>> good test.
> >>>> Good Idea I'll try this. BTW I thing the reason for such a high  
> >>>> rate of
> >>>> reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions
> >>>> (500) and process them one by one and if progress function is  
> >>>> called
> >>>> recursively next 500 completion will be reordered versus previous
> >>>> completions (reordering happens on a receiver, not sender).
> >>>>
> >>>>> Andrew
> >>>>>
> >>>>> Richard Graham wrote:
> >>>>>> Gleb,
> >>>>>>  I would suggest that before this is checked in this be tested  
> >>>>>> on a
> >>>>>> system
> >>>>>> that has N-way network parallelism, where N is as large as you  
> >>>>>> can find.
> >>>>>> This is a key bit of code for MPI correctness, and out-of-order  
> >>>>>> operations
> >>>>>> will break it, so you want to maximize the chance for such  
> >>>>>> operations.
> >>>>>>
> >>>>>> Rich
> >>>>>>
> >>>>>>
> >>>>>> On 12/11/07 10:54 AM, "Gleb Natapov"  wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>>   I did a rewrite of matching code in OB1. I made it much  
> >>>>>>> simpler and 2
> >>>>>>> times smaller (which is good, less code - less bugs). I also  
> >>>>>>> got rid
> >>>>>>> of huge macros - very helpful if you need to debug something.  
> >>>>>>> There
> >>>>>>> is no performance degradation, actually I even see very small  
> >>>>>>> performance
> >>>>>>> improvement. I ran MTT with this patc

Re: [OMPI devel] New BTL parameter

2007-12-12 Thread Gleb Natapov
On Wed, Dec 12, 2007 at 02:03:02PM -0500, Jeff Squyres wrote:
> On Dec 9, 2007, at 10:34 AM, Gleb Natapov wrote:
> 
> >  Currently BTL has parameter btl_min_send_size that is no longer used.
> > I want to change it to be btl_rndv_eager_limit. This new parameter  
> > will
> > determine a size of a first fragment of rendezvous protocol. Now we  
> > use
> > btl_eager_limit to set its size. btl_rndv_eager_limit will have to be
> > smaller or equal to btl_eager_limit. By default it will be equal to
> > btl_eager_limit so no behavior change will be observed if default is
> > used.
> 
> 
> Can you describe why it would be better to have the value less than  
> the eager limit?
> 
It is just one more knob to tune OB1 algorithm. I sometimes don't want
to send any data by copy in/out at all. This is not possible right now.
With this new param I will be able to control this.

--
Gleb.


Re: [OMPI devel] SCTP BTL exclusivity value problem

2007-12-12 Thread Gleb Natapov
On Wed, Dec 12, 2007 at 10:31:37AM -0500, Jeff Squyres wrote:
> I'd be in favor of setting the TCP exclusivity to LOW+100 and setting  
> SCTP exclusivity to LOW.
Fine with me.

> 
> 
> On Dec 12, 2007, at 10:07 AM, Gleb Natapov wrote:
> 
> > On Wed, Dec 12, 2007 at 10:02:07AM -0500, Jeff Squyres wrote:
> >> Yes -- this came up in a prior thread.  See what I proposed:
> >>
> >> http://www.open-mpi.org/community/lists/devel/2007/12/2698.php
> >>
> >> (no one replied, so no action was taken)
> >>
> >> Are you on a system where the SCTP BTL is being built?  What kind of
> >> environment is it?
> > Red Hat Enterprise Linux AS release 4 (Nahant Update 5)
> >
> > # rpm -qa | grep sctp
> > lksctp-tools-devel-1.0.2-6.4E.1
> > lksctp-tools-doc-1.0.2-6.4E.1
> > lksctp-tools-1.0.2-6.4E.1
> >
> >>
> >>
> >>
> >> On Dec 12, 2007, at 9:38 AM, Gleb Natapov wrote:
> >>
> >>> Hi,
> >>>
> >>> SCTP BTL sets its exclusivity value to MCA_BTL_EXCLUSIVITY_LOW - 1
> >>> but MCA_BTL_EXCLUSIVITY_LOW is zero so actually it is set to max
> >>> exclusivity possible. Can somebody fix this please? May be we should
> >>> not
> >>> define MCA_BTL_EXCLUSIVITY_LOW to zero?
> >>>
> >>> --
> >>>   Gleb.
> >>> ___
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >>
> >> -- 
> >> Jeff Squyres
> >> Cisco Systems
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > --
> > Gleb.
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> Cisco Systems
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] SCTP BTL exclusivity value problem

2007-12-12 Thread Gleb Natapov
On Wed, Dec 12, 2007 at 10:02:07AM -0500, Jeff Squyres wrote:
> Yes -- this came up in a prior thread.  See what I proposed:
> 
>  http://www.open-mpi.org/community/lists/devel/2007/12/2698.php
> 
> (no one replied, so no action was taken)
> 
> Are you on a system where the SCTP BTL is being built?  What kind of  
> environment is it?
Red Hat Enterprise Linux AS release 4 (Nahant Update 5)

# rpm -qa | grep sctp
lksctp-tools-devel-1.0.2-6.4E.1
lksctp-tools-doc-1.0.2-6.4E.1
lksctp-tools-1.0.2-6.4E.1

> 
> 
> 
> On Dec 12, 2007, at 9:38 AM, Gleb Natapov wrote:
> 
> > Hi,
> >
> >  SCTP BTL sets its exclusivity value to MCA_BTL_EXCLUSIVITY_LOW - 1
> > but MCA_BTL_EXCLUSIVITY_LOW is zero so actually it is set to max
> > exclusivity possible. Can somebody fix this please? May be we should  
> > not
> > define MCA_BTL_EXCLUSIVITY_LOW to zero?
> >
> > --
> > Gleb.
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> Cisco Systems
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


[OMPI devel] SCTP BTL exclusivity value problem

2007-12-12 Thread Gleb Natapov
Hi,

  SCTP BTL sets its exclusivity value to MCA_BTL_EXCLUSIVITY_LOW - 1
but MCA_BTL_EXCLUSIVITY_LOW is zero so actually it is set to max
exclusivity possible. Can somebody fix this please? May be we should not
define MCA_BTL_EXCLUSIVITY_LOW to zero?

--
Gleb.


Re: [OMPI devel] [PATCH] openib: clean-up connect to allow for new cm's

2007-12-12 Thread Gleb Natapov
On Wed, Dec 12, 2007 at 04:08:31PM +0200, Pavel Shamis (Pasha) wrote:
> Gleb Natapov wrote:
>> On Wed, Dec 12, 2007 at 03:37:26PM +0200, Pavel Shamis (Pasha) wrote:
>>   
>>> Gleb Natapov wrote:
>>> 
>>>> On Tue, Dec 11, 2007 at 08:16:07PM -0500, Jeff Squyres wrote:
>>>> 
>>>>> Isn't there a better way somehow?  Perhaps we should have "select"  
>>>>> call *all* the functions and accept back a priority.  The one with the  
>>>>> highest priority then wins.  This is quite similar to much of the  
>>>>> other selection logic in OMPI.
>>>>>
>>>>> Sidenote: Keep in mind that there are some changes coming to select  
>>>>> CPCs on a per-endpoint basis (I can't look up the trac ticket right  
>>>>> now...).  This makes things a little complicated -- do we need  
>>>>> btl_openib_cpc_include and btl_openib_cpc_exclude MCA params to  
>>>>> include/exclude CPCs (because you might need more than one CPC in a  
>>>>> single job)?  That wouldn't be hard to do.
>>>>>
>>>>> But then what to do about if someone sets to use some XRC QPs and  
>>>>> selects to use OOB or RDMA CM?  How do we catch this and print an  
>>>>> error?  It doesn't seem right to put the "if num_xrc_qps>0" check in  
>>>>> every CPC.  What happens if you try to make an XRC QP when not using  
>>>>> xoob?  Where is the error detected and what kind of error message do  
>>>>> we print?
>>>>>
>>>>> 
>>>> In my opinion "X" notation for QP specification should be removed. I
>>>> didn't want this to prevent XRC merging so I haven't raced this point.
>>>> It is enough to have two types of QPs "P" - SW credit management "S" -
>>>> HW credit management.   
>>> How will you decide witch QP type to use ? (SRQ or XRC)
>>>
>>> 
>> If both sides support XOOB and priority of XOOB is higher then all other 
>> CPC
>> then create XRC, otherwise use regular RC.
>>   
> If some body have connectX hca but  he want to use SRQ and not XRC ?
This will be the default. (prio of OOB will be bigger than of XOOB), but
if uses will want to use XRC it will increase XOOB priority by
specifying MCA parameter.

> I guess anyway we will be need some additional parameter that will allow 
> enable/disable XRC, correct ? (So why just not leave the X qp type ?)
Because we want to support mixed setups and create XRC between nodes that
support it and RC between all other nodes.

--
Gleb.


Re: [OMPI devel] [PATCH] openib: clean-up connect to allow for new cm's

2007-12-12 Thread Gleb Natapov
On Wed, Dec 12, 2007 at 03:37:26PM +0200, Pavel Shamis (Pasha) wrote:
> Gleb Natapov wrote:
> > On Tue, Dec 11, 2007 at 08:16:07PM -0500, Jeff Squyres wrote:
> >   
> >> Isn't there a better way somehow?  Perhaps we should have "select"  
> >> call *all* the functions and accept back a priority.  The one with the  
> >> highest priority then wins.  This is quite similar to much of the  
> >> other selection logic in OMPI.
> >>
> >> Sidenote: Keep in mind that there are some changes coming to select  
> >> CPCs on a per-endpoint basis (I can't look up the trac ticket right  
> >> now...).  This makes things a little complicated -- do we need  
> >> btl_openib_cpc_include and btl_openib_cpc_exclude MCA params to  
> >> include/exclude CPCs (because you might need more than one CPC in a  
> >> single job)?  That wouldn't be hard to do.
> >>
> >> But then what to do about if someone sets to use some XRC QPs and  
> >> selects to use OOB or RDMA CM?  How do we catch this and print an  
> >> error?  It doesn't seem right to put the "if num_xrc_qps>0" check in  
> >> every CPC.  What happens if you try to make an XRC QP when not using  
> >> xoob?  Where is the error detected and what kind of error message do  
> >> we print?
> >>
> >> 
> > In my opinion "X" notation for QP specification should be removed. I
> > didn't want this to prevent XRC merging so I haven't raced this point.
> > It is enough to have two types of QPs "P" - SW credit management "S" -
> > HW credit management. 
> How will you decide witch QP type to use ? (SRQ or XRC)
> 
If both sides support XOOB and priority of XOOB is higher then all other CPC
then create XRC, otherwise use regular RC.

--
Gleb.


Re: [OMPI devel] [PATCH] openib: clean-up connect to allow for new cm's

2007-12-12 Thread Gleb Natapov
On Tue, Dec 11, 2007 at 08:16:07PM -0500, Jeff Squyres wrote:
> Isn't there a better way somehow?  Perhaps we should have "select"  
> call *all* the functions and accept back a priority.  The one with the  
> highest priority then wins.  This is quite similar to much of the  
> other selection logic in OMPI.
> 
> Sidenote: Keep in mind that there are some changes coming to select  
> CPCs on a per-endpoint basis (I can't look up the trac ticket right  
> now...).  This makes things a little complicated -- do we need  
> btl_openib_cpc_include and btl_openib_cpc_exclude MCA params to  
> include/exclude CPCs (because you might need more than one CPC in a  
> single job)?  That wouldn't be hard to do.
> 
> But then what to do about if someone sets to use some XRC QPs and  
> selects to use OOB or RDMA CM?  How do we catch this and print an  
> error?  It doesn't seem right to put the "if num_xrc_qps>0" check in  
> every CPC.  What happens if you try to make an XRC QP when not using  
> xoob?  Where is the error detected and what kind of error message do  
> we print?
> 
In my opinion "X" notation for QP specification should be removed. I
didn't want this to prevent XRC merging so I haven't raced this point.
It is enough to have two types of QPs "P" - SW credit management "S" -
HW credit management. I think connection management should work like
this: Each BTL knows what type of CPC it can use and it should share
this info during modex stage. During connection establishment modex info
is used to figure out the list of CPCs that both endpoints support and one
with highest prio is selected.

--
Gleb.


Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Gleb Natapov
On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote:
> Try UD, frags are reordered at a very high rate so should be a good test.
mpi-ping works fine with UD BTL and the patch.

> 
> Andrew
> 
> Richard Graham wrote:
> > Gleb,
> >   I would suggest that before this is checked in this be tested on a system
> > that has N-way network parallelism, where N is as large as you can find.
> > This is a key bit of code for MPI correctness, and out-of-order operations
> > will break it, so you want to maximize the chance for such operations.
> > 
> > Rich
> > 
> > 
> > On 12/11/07 10:54 AM, "Gleb Natapov"  wrote:
> > 
> >> Hi,
> >>
> >>I did a rewrite of matching code in OB1. I made it much simpler and 2
> >> times smaller (which is good, less code - less bugs). I also got rid
> >> of huge macros - very helpful if you need to debug something. There
> >> is no performance degradation, actually I even see very small performance
> >> improvement. I ran MTT with this patch and the result is the same as on
> >> trunk. I would like to commit this to the trunk. The patch is attached
> >> for everybody to try.

--
Gleb.


Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Gleb Natapov
On Tue, Dec 11, 2007 at 08:36:42AM -0800, Andrew Friedley wrote:
> Possibly, though I have results from a benchmark I've written indicating 
> the reordering happens at the sender.  I believe I found it was due to 
> the QP striping trick I use to get more bandwidth -- if you back down to 
> one QP (there's a define in the code you can change), the reordering 
> rate drops.
Ah, OK. My assumption was just from looking into code, so I may be
wrong.

> 
> Also I do not make any recursive calls to progress -- at least not 
> directly in the BTL; I can't speak for the upper layers.  The reason I 
> do many completions at once is that it is a big help in turning around 
> receive buffers, making it harder to run out of buffers and drop frags. 
>   I want to say there was some performance benefit as well but I can't 
> say for sure.
Currently upper layers of Open MPI may call BTL progress function
recursively. I hope this will change some day.

> 
> Andrew
> 
> Gleb Natapov wrote:
> > On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote:
> >> Try UD, frags are reordered at a very high rate so should be a good test.
> > Good Idea I'll try this. BTW I thing the reason for such a high rate of
> > reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions
> > (500) and process them one by one and if progress function is called
> > recursively next 500 completion will be reordered versus previous
> > completions (reordering happens on a receiver, not sender).
> > 
> >> Andrew
> >>
> >> Richard Graham wrote:
> >>> Gleb,
> >>>   I would suggest that before this is checked in this be tested on a 
> >>> system
> >>> that has N-way network parallelism, where N is as large as you can find.
> >>> This is a key bit of code for MPI correctness, and out-of-order operations
> >>> will break it, so you want to maximize the chance for such operations.
> >>>
> >>> Rich
> >>>
> >>>
> >>> On 12/11/07 10:54 AM, "Gleb Natapov"  wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>>I did a rewrite of matching code in OB1. I made it much simpler and 2
> >>>> times smaller (which is good, less code - less bugs). I also got rid
> >>>> of huge macros - very helpful if you need to debug something. There
> >>>> is no performance degradation, actually I even see very small performance
> >>>> improvement. I ran MTT with this patch and the result is the same as on
> >>>> trunk. I would like to commit this to the trunk. The patch is attached
> >>>> for everybody to try.
> >>>>
> >>>> --
> >>>> Gleb.
> >>>> ___
> >>>> devel mailing list
> >>>> de...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> ___
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > --
> > Gleb.
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Gleb Natapov
On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote:
> Try UD, frags are reordered at a very high rate so should be a good test.
Good Idea I'll try this. BTW I thing the reason for such a high rate of
reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions
(500) and process them one by one and if progress function is called
recursively next 500 completion will be reordered versus previous
completions (reordering happens on a receiver, not sender).

> 
> Andrew
> 
> Richard Graham wrote:
> > Gleb,
> >   I would suggest that before this is checked in this be tested on a system
> > that has N-way network parallelism, where N is as large as you can find.
> > This is a key bit of code for MPI correctness, and out-of-order operations
> > will break it, so you want to maximize the chance for such operations.
> > 
> > Rich
> > 
> > 
> > On 12/11/07 10:54 AM, "Gleb Natapov"  wrote:
> > 
> >> Hi,
> >>
> >>I did a rewrite of matching code in OB1. I made it much simpler and 2
> >> times smaller (which is good, less code - less bugs). I also got rid
> >> of huge macros - very helpful if you need to debug something. There
> >> is no performance degradation, actually I even see very small performance
> >> improvement. I ran MTT with this patch and the result is the same as on
> >> trunk. I would like to commit this to the trunk. The patch is attached
> >> for everybody to try.
> >>
> >> --
> >> Gleb.
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Gleb Natapov
On Tue, Dec 11, 2007 at 10:00:08AM -0600, Brian W. Barrett wrote:
> On Tue, 11 Dec 2007, Gleb Natapov wrote:
> 
> >   I did a rewrite of matching code in OB1. I made it much simpler and 2
> > times smaller (which is good, less code - less bugs). I also got rid
> > of huge macros - very helpful if you need to debug something. There
> > is no performance degradation, actually I even see very small performance
> > improvement. I ran MTT with this patch and the result is the same as on
> > trunk. I would like to commit this to the trunk. The patch is attached
> > for everybody to try.
> 
> I don't think we can live without those macros :).  Out of curiousity, is 
> there any functionality that was removed as a result of this change?
No. The way out of order packets are handled changed a little bit, but
they are handled in correct order.

> 
> I'll test on a couple systems over the next couple of days...
> 
Thanks!

--
Gleb.


Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Gleb Natapov
On Tue, Dec 11, 2007 at 11:00:51AM -0500, Richard Graham wrote:
> Gleb,
>   I would suggest that before this is checked in this be tested on a system
> that has N-way network parallelism, where N is as large as you can find.
> This is a key bit of code for MPI correctness, and out-of-order operations
> will break it, so you want to maximize the chance for such operations.
> 
I started this rewrite while chasing this bug 
https://svn.open-mpi.org/trac/ompi/ticket/1158.
As you can see OpenIB reorders fragment quite a bit unfortunately :(
No testing is enough for such important piece of code of cause.

--
Gleb.


[OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Gleb Natapov
Hi,

   I did a rewrite of matching code in OB1. I made it much simpler and 2
times smaller (which is good, less code - less bugs). I also got rid
of huge macros - very helpful if you need to debug something. There
is no performance degradation, actually I even see very small performance
improvement. I ran MTT with this patch and the result is the same as on
trunk. I would like to commit this to the trunk. The patch is attached
for everybody to try.

--
Gleb.
diff --git a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
index d3f7c37..299ae9e 100644
--- a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
+++ b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
@@ -184,244 +184,159 @@ void mca_pml_ob1_recv_frag_callback( mca_btl_base_module_t* btl,
 }
 }

-/**
- * Try and match the incoming message fragment to a generic
- * list of receives
- *
- * @param hdr Matching data from received fragment (IN)
- *
- * @param generic_receives Pointer to the receive list used for
- * matching purposes. (IN)
- *
- * @return Matched receive
- *
- * This routine assumes that the appropriate matching locks are
- * set by the upper level routine.
- */
-#define MCA_PML_OB1_MATCH_GENERIC_RECEIVES(hdr,generic_receives,proc,return_match) \
-do {   \
-/* local variables */  \
-mca_pml_ob1_recv_request_t *generic_recv = (mca_pml_ob1_recv_request_t *)  \
- opal_list_get_first(generic_receives);\
-mca_pml_ob1_recv_request_t *last_recv = (mca_pml_ob1_recv_request_t *) \
-opal_list_get_end(generic_receives);   \
-register int recv_tag, frag_tag = hdr->hdr_tag;\
-   \
-/* Loop over the receives. If the received tag is less than zero  */   \
-/* enter in a special mode, where we match only our internal tags */   \
-/* (such as those used by the collectives.*/   \
-if( 0 <= frag_tag ) {  \
-for( ; generic_recv != last_recv;  \
- generic_recv = (mca_pml_ob1_recv_request_t *) \
- ((opal_list_item_t *)generic_recv)->opal_list_next) { \
-/* Check for a match */\
-recv_tag = generic_recv->req_recv.req_base.req_tag;\
-if ( (frag_tag == recv_tag) || (recv_tag == OMPI_ANY_TAG) ) {  \
-break; \
-}  \
-}  \
-} else {   \
-for( ; generic_recv != last_recv;  \
- generic_recv = (mca_pml_ob1_recv_request_t *) \
- ((opal_list_item_t *)generic_recv)->opal_list_next) { \
-/* Check for a match */\
-recv_tag = generic_recv->req_recv.req_base.req_tag;\
-if( OPAL_UNLIKELY(frag_tag == recv_tag) ) {\
-break; \
-}  \
-}  \
-}  \
-if( generic_recv != (mca_pml_ob1_recv_request_t *) \
-opal_list_get_end(generic_receives) ) {\
-   \
-/* Match made */   \
-return_match = generic_recv;   \
-   \
-/* remove descriptor from posted specific ireceive list */ \
-opal_list_remove_item(generic_receives,\
-  (opal_list_item_t *)generic_recv);   \
-PERUSE_TRACE_COMM_EVENT (PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q, \
- &(generic_recv->req_recv.req_base),   \
- P

Re: [OMPI devel] opal_condition_wait

2007-12-11 Thread Gleb Natapov
On Tue, Dec 11, 2007 at 10:27:55AM -0500, Tim Prins wrote:
> My understanding was that this behavior was not right, but upon further 
> inspection of the pthreads documentation this behavior seems to be 
> allowable.
> 
I think that Open MPI does not implement condition variable in the strict
sense. Open MPI condition variable has to progress devices and wait for a
condition simultaneously and not just wait till a condition is satisfied.

--
Gleb.


[OMPI devel] New BTL parameter

2007-12-09 Thread Gleb Natapov
Hi,

  Currently BTL has parameter btl_min_send_size that is no longer used.
I want to change it to be btl_rndv_eager_limit. This new parameter will
determine a size of a first fragment of rendezvous protocol. Now we use
btl_eager_limit to set its size. btl_rndv_eager_limit will have to be
smaller or equal to btl_eager_limit. By default it will be equal to
btl_eager_limit so no behavior change will be observed if default is
used.

--
Gleb.


[OMPI devel] Changes to all BTLs.

2007-12-09 Thread Gleb Natapov
Hi everybody,

 I committed changes to BTL interface. Two new parameters are now
provided to descriptor allocation: endpoint and flags. I did my best to
change all in tree BTLs, but I can't compile all of them, so compilation
problems are possible. Can everybody test that the BTLs they care about
still can be compiled?

--
Gleb.


Re: [OMPI devel] 32-bit openib is broken on the trunk as of Nov 27th, r16799

2007-12-09 Thread Gleb Natapov
On Wed, Dec 05, 2007 at 02:45:17PM -0500, Tim Mattox wrote:
> Hello,
> It appears that sometime after r16777, and by r16799, that something
> was broken on the trunk's openib support for 32-bit builds.
> The 64-bit tests all seem normal, as well as the 32-bit & 64-bit tests on
> the 1.2 branch on the same machine (odin).
> 
> See this MTT results page permalink showing the 32-bit odin runs:
> http://www.open-mpi.org/mtt/index.php?do_redir=468
> 
> Pasha & Gleb, you both did a variety of checkins in that svn r# range.
> Do either of you have time to investigate this?
> 
> Here is a snippet from one randomly picked failed test (out of thousands):
> [1,1][btl_openib_component.c:1665:btl_openib_module_progress] from
> odin001 to: odin001 error
> polling LP CQ with status LOCAL PROTOCOL ERROR status number 4 for
> wr_id 141733120 opcode 128
> qp_idx 3
> --
> mpirun has exited due to process rank 1 with PID 29761 on
> node odin001 calling "abort". This will have caused other processes
> in the application to be terminated by signals sent by mpirun
> (as reported here).
> --
> 
> Thanks, and happy bug hunting!
I know where the problem is. Will fix this week.
--
Gleb.


Re: [OMPI devel] opal_condition_wait

2007-12-06 Thread Gleb Natapov
On Thu, Dec 06, 2007 at 09:46:45AM -0500, Tim Prins wrote:
> Also, when we are using threads, there is a case where we do not 
> decrement the signaled count, in condition.h:84. Gleb put this in in 
> r9451, however the change does not make sense to me. I think that the 
> signal count should always be decremented.
> 
> Can anyone shine any light on these issues?
> 
I made this change a long time ago (I wander why I even tested threaded
build back then), but what I recall looking into the code and log message
there was a deadlock when signal broadcast doesn't wake up all thread
that are waiting on a conditional variable. Suppose two threads wait on
a condition C, third thread does broadcast. This makes C->c_signaled to
be equal 2. Now one thread wakes up and decrement C->c_signaled by one.
And before other thread is starting to run it calls condition_wait on C
one more time. Because c_signaled is 1 it doesn't sleep and decrement
c_signaled one more time. Now c_signaled is zero and when second thread
wakes up it see this and go to sleep again. The solution was to check in
condition_wait if condition is already signaled before go to sleep and
if yes exit immediately.

--
Gleb.


Re: [OMPI devel] tmp XRC branches

2007-11-30 Thread Gleb Natapov
On Fri, Nov 30, 2007 at 02:06:02PM -0500, Jeff Squyres wrote:
> Are any of the XRC tmp SVN branches still relevant?  Or have they now  
> been integrated into the trunk?
> 
> I ask because I see 4 XRC-related branches out there under /tmp and / 
> tmp-public.
They are not relevant any more. I'll remove the one I created.

--
Gleb.


Re: [OMPI devel] THREAD_MULTIPLE

2007-11-28 Thread Gleb Natapov
On Wed, Nov 28, 2007 at 01:46:53PM -0500, George Bosilca wrote:
> Yes, "us" means UTK. Our math folks are pushing hard for this. I'll gladly 
> accept any help, even if it's only for testing. For development, I dispose 
> of some of my time and a 100% of a post-doc for few months.
I already worked on this for some time and I can spend more time on
this. I am mainly interested in working on PML/BTL but there are other parts
of MPI that are not related to communication, but still need to be
thread safe.

>
> However, there are limits to what we can do. We will make sure the BTL 
> threading requirements are clearly specified, and we will take care of the 
> BTLs we already worked on (TCP, self, SM, MX). I hope that once the BTL 
> interface is defined, others can make sure their BTL follow the guidelines.
>
>   Thanks,
> george.
>
> On Nov 28, 2007, at 1:34 PM, Jeff Squyres wrote:
>
>> On Nov 28, 2007, at 1:26 PM, George Bosilca wrote:
>>
>>> There is a priority change for us.
>>
>> "us" = UTK?
>>
>>> It's definitively time to have a fully supported MPI_THREAD_MULTIPLE
>>> mode in Open MPI. I'm working to figure out how and where to get the
>>> cycles for this. I expect to start working on it in January. So, the
>>> good news is that 1.3 will have thread support.
>>
>> That will be great.  Do you really think that you can finish the
>> THREAD_MULTIPLE work by yourself?
>>
>> Cisco can provide some resources for testing (in the environments that
>> we care about :-) ), but probably not for development.
>>
>> -- 
>> Jeff Squyres
>> Cisco Systems
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] IB/OpenFabrics pow wow

2007-11-19 Thread Gleb Natapov
On Fri, Nov 16, 2007 at 11:36:39AM -0800, Jeff Squyres wrote:
> 1. Mon, 26 Nov, 10am US East, 7am US Pacific, 5pm Israel
> 2. Mon, 26 Nov, 11am US East, 8am US Pacific, 6pm Israel
> 3. Thu, 29 Nov, 10am US East, 7am US Pacific, 5pm Israel
> 4. Thu, 29 Nov, 11am US East, 8am US Pacific, 6pm Israel
> 5. Fri, 30 Nov, 10am US East, 7am US Pacific, 5pm Israel
> 6. Fri, 30 Nov, 11am US East, 8am US Pacific, 6pm Israel
> 
1,2,3 or 4 are OK with me. Friday is not a working day in Israel.

--
Gleb.


Re: [OMPI devel] [OMPI svn] svn:open-mpi r16723

2007-11-14 Thread Gleb Natapov
On Wed, Nov 14, 2007 at 06:44:06AM -0800, Tim Prins wrote:
> Hi,
> 
> The following files bother me about this commit:
>  trunk/ompi/mca/btl/sctp/sctp_writev.c
>  trunk/ompi/mca/btl/sctp/sctp_writev.h
> 
> They bother me for 2 reasons:
> 1. Their naming does not follow the prefix rule
> 2. They are LGPL licensed. While I personally like the LGPL, I do not 
> believe it is compatible with the BSD license that OMPI is distributed 
> under. I think (though I could be wrong) that these files need to be 
> removed from the repository and the functionality implemented in some 
> other way.

Is function that fills a couple of struct fields can be reimplemented in
any other way? :)

> 
> Tim
> 
> 
> pen...@osl.iu.edu wrote:
> > Author: penoff
> > Date: 2007-11-13 18:39:16 EST (Tue, 13 Nov 2007)
> > New Revision: 16723
> > URL: https://svn.open-mpi.org/trac/ompi/changeset/16723
> > 
> > Log:
> > initial SCTP BTL commit
> > Added:
> >trunk/ompi/mca/btl/sctp/
> >trunk/ompi/mca/btl/sctp/.ompi_ignore
> >trunk/ompi/mca/btl/sctp/.ompi_unignore
> >trunk/ompi/mca/btl/sctp/Makefile.am
> >trunk/ompi/mca/btl/sctp/btl_sctp.c
> >trunk/ompi/mca/btl/sctp/btl_sctp.h
> >trunk/ompi/mca/btl/sctp/btl_sctp_addr.h
> >trunk/ompi/mca/btl/sctp/btl_sctp_component.c
> >trunk/ompi/mca/btl/sctp/btl_sctp_component.h
> >trunk/ompi/mca/btl/sctp/btl_sctp_endpoint.c
> >trunk/ompi/mca/btl/sctp/btl_sctp_endpoint.h
> >trunk/ompi/mca/btl/sctp/btl_sctp_frag.c
> >trunk/ompi/mca/btl/sctp/btl_sctp_frag.h
> >trunk/ompi/mca/btl/sctp/btl_sctp_hdr.h
> >trunk/ompi/mca/btl/sctp/btl_sctp_proc.c
> >trunk/ompi/mca/btl/sctp/btl_sctp_proc.h
> >trunk/ompi/mca/btl/sctp/btl_sctp_recv_handler.c
> >trunk/ompi/mca/btl/sctp/btl_sctp_recv_handler.h
> >trunk/ompi/mca/btl/sctp/btl_sctp_utils.c
> >trunk/ompi/mca/btl/sctp/btl_sctp_utils.h
> >trunk/ompi/mca/btl/sctp/configure.m4
> >trunk/ompi/mca/btl/sctp/configure.params
> >trunk/ompi/mca/btl/sctp/sctp_writev.c
> >trunk/ompi/mca/btl/sctp/sctp_writev.h
> > 
> > 
> > Diff not shown due to size (201438 bytes).
> > To see the diff, run the following command:
> > 
> > svn diff -r 16722:16723 --no-diff-deleted
> > 
> > ___
> > svn mailing list
> > s...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/svn
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] Multi-Rail and Open IB BTL

2007-11-14 Thread Gleb Natapov
Sorry I missed a mail with the question.

On Mon, Nov 12, 2007 at 06:03:07AM -0500, Jeff Squyres wrote:
> On Nov 9, 2007, at 1:24 PM, Don Kerr wrote:
> 
> > both, I was thinking of listing what I think are multi-rail  
> > requirements
> > but wanted to understand what the current state of things are
> 
> I believe the OF portion of the FAQ describes what we do in the v1.2  
> series (right Gleb?); I honestly don't remember what we do today on  
> the trunk (I'm pretty sure that Gleb has tweaked it recently).
I haven't tweaked anything related to this recently. If one host has two
ports and another has one port only one connection is established
between them.

--
Gleb.


Re: [OMPI devel] collective problems

2007-11-08 Thread Gleb Natapov
On Wed, Nov 07, 2007 at 11:25:43PM -0500, Patrick Geoffray wrote:
> Richard Graham wrote:
> > The real problem, as you and others have pointed out is the lack of
> > predictable time slices for the progress engine to do its work, when relying
> > on the ULP to make calls into the library...
> 
> The real, real problem is that the BTL should handle progression at 
> their level, specially when the buffering is due to BTL-level flow 
> control. When I write something into a socket, TCP will take care of 
> sending it eventually, for example.
In the case of TCP, kernel is kind enough to progress message for you,
but only if there was enough space in a kernel internal buffers. If there
was no place there, TCP BTL will also buffer messages in userspace and
will, eventually, have the same problem.

To progress such outstanding messages additional thread is needed in
userspace. Is this what MX does?

--
Gleb.


Re: [OMPI devel] collective problems

2007-11-08 Thread Gleb Natapov
On Wed, Nov 07, 2007 at 01:16:04PM -0500, George Bosilca wrote:
>
> On Nov 7, 2007, at 12:51 PM, Jeff Squyres wrote:
>
>>> The same callback is called in both cases. In the case that you
>>> described, the callback is called just a little bit deeper into the
>>> recursion, when in the "normal case" it will get called from the
>>> first level of the recursion. Or maybe I miss something here ...
>>
>> Right -- it's not the callback that is the problem.  It's when the
>> recursion is unwound and further up the stack you now have a stale
>> request.
>
> That's exactly the point that I fail to see. If the request is freed in the 
> PML callback, then it should get release in both cases, and therefore lead 
> to problems all the time. Which, obviously, is not true when we do not have 
> this deep recursion thing going on.
>
> Moreover, he request management is based on the reference count. The PML 
> level have one ref count and the MPI level have another one. In fact, we 
> cannot release a request until we explicitly call ompi_request_free on it. 
> The place where this call happens is different between the blocking and non 
> blocking calls. In the non blocking case the ompi_request_free get called 
> from the *_test (*_wait) functions while in the blocking case it get called 
> directly from the MPI_Send function.
>
> Let me summarize: a request cannot reach a stale state without a call to 
> ompi_request_free. This function is never called directly from the PML 
> level. Therefore, the recursion depth should not have any impact on the 
> state of the request !

I looked at the code one more time and it seems to me now that George is
absolutely right. The scenario I described cannot happen because we call
ompi_request_free() at the top of the stack. I somehow had an
impression that we mark internal requests as freed before calling
send(). So I'll go and implement NOT_ON_WIRE extension when I'll have
time for it.

--
Gleb.


Re: [OMPI devel] collective problems

2007-11-08 Thread Gleb Natapov
On Wed, Nov 07, 2007 at 09:07:23PM -0700, Brian Barrett wrote:
> Personally, I'd rather just not mark MPI completion until a local  
> completion callback from the BTL.  But others don't like that idea, so  
> we came up with a way for back pressure from the BTL to say "it's not  
> on the wire yet".  This is more complicated than just not marking MPI  
> completion early, but why would we do something that helps real apps  
> at the expense of benchmarks?  That would just be silly!
> 
I fully agree with Brian here. Trying to solve the issue with current
approach will introduce additional checking in the fast path and will
only hurt real apps.

> Brian
> 
> On Nov 7, 2007, at 7:56 PM, Richard Graham wrote:
> 
> > Does this mean that we don’t have a queue to store btl level  
> > descriptors that
> >  are only partially complete ?  Do we do an all or nothing with  
> > respect to btl
> >  level requests at this stage ?
> >
> > Seems to me like we want to mark things complete at the MPI level  
> > ASAP, and
> >  that this proposal is not to do that – is this correct ?
> >
> > Rich
> >
> >
> > On 11/7/07 11:26 PM, "Jeff Squyres"  wrote:
> >
> >> On Nov 7, 2007, at 9:33 PM, Patrick Geoffray wrote:
> >>
> >> >> Remember that this is all in the context of Galen's proposal for
> >> >> btl_send() to be able to return NOT_ON_WIRE -- meaning that the  
> >> send
> >> >> was successful, but it has not yet been sent (e.g., openib BTL
> >> >> buffered it because it ran out of credits).
> >> >
> >> > Sorry if I miss something obvious, but why does the PML has to be
> >> > aware
> >> > of the flow control situation of the BTL ? If the BTL cannot send
> >> > something right away for any reason, it should be the  
> >> responsibility
> >> > of
> >> > the BTL to buffer it and to progress on it later.
> >>
> >>
> >> That's currently the way it is.  But the BTL currently only has the
> >> option to say two things:
> >>
> >> 1. "ok, done!" -- then the PML will think that the request is  
> >> complete
> >> 2. "doh -- error!" -- then the PML thinks that Something Bad
> >> Happened(tm)
> >>
> >> What we really need is for the BTL to have a third option:
> >>
> >> 3. "not done yet!"
> >>
> >> So that the PML knows that the request is not yet done, but will  
> >> allow
> >> other things to progress while we're waiting for it to complete.
> >> Without this, the openib BTL currently replies "ok, done!", even when
> >> it has only buffered a message (rather than actually sending it out).
> >> This optimization works great (yeah, I know...) except for apps that
> >> don't dip into the MPI library frequently.  :-\
> >>
> >> --
> >> Jeff Squyres
> >> Cisco Systems
> >>
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.



Re: [OMPI devel] Multi-Rail and Open IB BTL

2007-11-01 Thread Gleb Natapov
On Thu, Nov 01, 2007 at 11:15:21AM -0400, Don Kerr wrote:
> How would the openib btl handle the following scenario:
> Two nodes, each with two ports, all ports are on the same subnet and switch.
> 
> Would striping occur over 4 connections or 2?
Only two connections will be created.

> 
> If 2 is it equal distribution or are both local ports connected to the 
> same remote port?
> 
Equal distribution.

--
Gleb.


[OMPI devel] bml_btl->btl_alloc() instead of mca_bml_base_alloc() in OSC

2007-10-28 Thread Gleb Natapov
Hi Brian,

  Is there a special reason why you call btl functions directly instead
of using bml wrappers? What about applying this patch?


diff --git a/ompi/mca/osc/rdma/osc_rdma_component.c 
b/ompi/mca/osc/rdma/osc_rdma_component.c
index 2d0dc06..302dd9e 100644
--- a/ompi/mca/osc/rdma/osc_rdma_component.c
+++ b/ompi/mca/osc/rdma/osc_rdma_component.c
@@ -1044,9 +1044,8 @@ rdma_send_info_send(ompi_osc_rdma_module_t *module,
 ompi_osc_rdma_rdma_info_header_t *header = NULL;

 bml_btl = peer_send_info->bml_btl;
-descriptor = bml_btl->btl_alloc(bml_btl->btl,
-MCA_BTL_NO_ORDER,
-sizeof(ompi_osc_rdma_rdma_info_header_t));
+mca_bml_base_alloc(bml_btl, &descriptor, MCA_BTL_NO_ORDER,
+sizeof(ompi_osc_rdma_rdma_info_header_t));
 if (NULL == descriptor) {
 ret = OMPI_ERR_TEMP_OUT_OF_RESOURCE;
 goto cleanup;
diff --git a/ompi/mca/osc/rdma/osc_rdma_data_move.c 
b/ompi/mca/osc/rdma/osc_rdma_data_move.c
index e9fd17c..e7b5813 100644
--- a/ompi/mca/osc/rdma/osc_rdma_data_move.c
+++ b/ompi/mca/osc/rdma/osc_rdma_data_move.c
@@ -454,10 +454,10 @@ ompi_osc_rdma_sendreq_send(ompi_osc_rdma_module_t *module,
 /* get a buffer... */
 endpoint = (mca_bml_base_endpoint_t*) 
sendreq->req_target_proc->proc_bml;
 bml_btl = mca_bml_base_btl_array_get_next(&endpoint->btl_eager);
-descriptor = bml_btl->btl_alloc(bml_btl->btl,
-MCA_BTL_NO_ORDER,
-module->m_use_buffers ? 
bml_btl->btl_eager_limit : needed_len < bml_btl->btl_eager_limit ? needed_len :
-bml_btl->btl_eager_limit);
+mca_bml_base_alloc(bml_btl, &descriptor, MCA_BTL_NO_ORDER,
+module->m_use_buffers ? bml_btl->btl_eager_limit :
+needed_len < bml_btl->btl_eager_limit ? needed_len :
+bml_btl->btl_eager_limit);
 if (NULL == descriptor) {
 ret = OMPI_ERR_TEMP_OUT_OF_RESOURCE;
 goto cleanup;
@@ -698,9 +698,8 @@ ompi_osc_rdma_replyreq_send(ompi_osc_rdma_module_t *module,
 /* Get a BTL and a fragment to go with it */
 endpoint = (mca_bml_base_endpoint_t*) replyreq->rep_origin_proc->proc_bml;
 bml_btl = mca_bml_base_btl_array_get_next(&endpoint->btl_eager);
-descriptor = bml_btl->btl_alloc(bml_btl->btl,
-MCA_BTL_NO_ORDER,
-bml_btl->btl_eager_limit);
+mca_bml_base_alloc(bml_btl, &descriptor, MCA_BTL_NO_ORDER,
+bml_btl->btl_eager_limit);
 if (NULL == descriptor) {
 ret = OMPI_ERR_TEMP_OUT_OF_RESOURCE;
 goto cleanup;
@@ -1260,9 +1259,8 @@ ompi_osc_rdma_control_send(ompi_osc_rdma_module_t *module,
 /* Get a BTL and a fragment to go with it */
 endpoint = (mca_bml_base_endpoint_t*) proc->proc_bml;
 bml_btl = mca_bml_base_btl_array_get_next(&endpoint->btl_eager);
-descriptor = bml_btl->btl_alloc(bml_btl->btl,
-MCA_BTL_NO_ORDER,
-sizeof(ompi_osc_rdma_control_header_t));
+mca_bml_base_alloc(bml_btl, &descriptor, MCA_BTL_NO_ORDER,
+sizeof(ompi_osc_rdma_control_header_t));
 if (NULL == descriptor) {
 ret = OMPI_ERR_TEMP_OUT_OF_RESOURCE;
 goto cleanup;
@@ -1322,9 +1320,8 @@ ompi_osc_rdma_rdma_ack_send(ompi_osc_rdma_module_t 
*module,
 ompi_osc_rdma_control_header_t *header = NULL;

 /* Get a BTL and a fragment to go with it */
-descriptor = bml_btl->btl_alloc(bml_btl->btl,
-rdma_btl->rdma_order,
-sizeof(ompi_osc_rdma_control_header_t));
+mca_bml_base_alloc(bml_btl, &descriptor, rdma_btl->rdma_order,
+sizeof(ompi_osc_rdma_control_header_t));
 if (NULL == descriptor) {
 ret = OMPI_ERR_TEMP_OUT_OF_RESOURCE;
 goto cleanup;
--
Gleb.


Re: [OMPI devel] RFC: Add "connect" field to openib BTL INI file

2007-10-25 Thread Gleb Natapov
On Thu, Oct 25, 2007 at 10:55:25AM -0400, Jeff Squyres wrote:
> On Oct 25, 2007, at 10:35 AM, Gleb Natapov wrote:
> 
> > I don't think xrc should be used by default even if HW supports it.  
> > Only if
> > special config option is set xrc should be attempted.
> 
> Why?

XRC is a crippled RC protocol for scalability sake. Its use makes
progress of one process depend on behaviour of other processes on the
same node which make cause different interesting effects. And of cause 
SW flow control is not possible when using XRC, so for small jobs it
will be actually slower. I don't thinks it is wise to use XRC over
regular RC if there is a choice.

> 
> > And xrc availability
> > can be tested in runtime without additional options in ini file.
> 
> Is there a flag on the device / port that indicates XRC availability?
XRC requires creation of special kind of QP. If this fails XRC is not
available.

> 
> > I don't know iWarp enough to tell if it is possible to find out in
> > runtime if rdma_cm is mandatory or other means of connection
> > establishment can be used, but if there is no way to do it, then new
> > parameter "hca_type" could be added to ini file with two possible
> > values "ib" and "iwarp".
> 
> Yes, there is a flag on either the device or port (I forget which)  
> which indicates whether it's an iwarp or IB device.  I think (at  
> least for today) we can assume that all iWARP devices require RDMA CM  
> -- right, iWARP guys?
Great! Then I don't see the need to add parameters to ini file.

> 
> So do you want the arbitration rules for which CPC to be used to be  
> hard-coded in the openib component (possibly overridden by MCA  
> parameter to force a specific selection)?
> 
Not hard-coded, but controlled by regular mca mechanism, with default
behaviour dependant on HCA type. Not something new. We have this with
SRQ.

--
Gleb.


Re: [OMPI devel] RFC: Add "connect" field to openib BTL INI file

2007-10-25 Thread Gleb Natapov
On Wed, Oct 24, 2007 at 08:01:44PM -0400, Jeff Squyres wrote:
> My proposal is that the "connect" field can be added to the INI file  
> and take a comma-delimited list of values of acceptable CPCs for a  
> given device.  For example, the ConnectX HCA can take the following  
> value:
> 
>  connect = xrc, rdma_cm, oob
> 
> Meaning:
> 
> - first, try the XRC CPC to try to make the connection
> - if that fails, try the RDMA CM CPC
> - if that fails, try the OOB CPC
> - if that fails, fail the connection
> 
I don't think xrc should be used by default even if HW supports it. Only if
special config option is set xrc should be attempted. And xrc availability
can be tested in runtime without additional options in ini file.
I don't know iWarp enough to tell if it is possible to find out in
runtime if rdma_cm is mandatory or other means of connection
establishment can be used, but if there is no way to do it, then new
parameter "hca_type" could be added to ini file with two possible
values "ib" and "iwarp".


> iWARP-based NICs can use the following value:
> 
>  connect = rdma_cm
> 
> If no "connect" value is specified, then default value of "oob,  
> rdma_cm" can be assumed (possibly someday changing to "rdma_cm, oob").
> 
> I mention this here on the devel list because disparate groups are  
> working on different CPC's; coordination will be required to  
> implement this arbitration mechanism.
> 
> Comments?
> 

--
Gleb.


Re: [OMPI devel] collective problems

2007-10-23 Thread Gleb Natapov
On Tue, Oct 23, 2007 at 09:40:45AM -0400, Shipman, Galen M. wrote:
> So this problem goes WAY back..
> 
> The problem here is that the PML marks MPI completion just prior to calling
> btl_send and then returns to the user. This wouldn't be a problem if the BTL
> then did something, but in the case of OpenIB this fragment may not actually
> be on the wire (the joys of user level flow control).
> 
> One solution that we proposed was to allow btl_send to return either
> OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the PML to
> not mark MPI completion of the fragment and then MPI_WAITALL and others will
> do there job properly.
I even implemented this once, but there is a problem. Currently we mark
request as completed on MPI level and then do btl_send(). Whenever IB completion
will happen the request will be marked as complete on PML level and
freed. The fix requires to change the order like this: Call btl_send(),
check return value from BTL and mark request complete as necessary. The
problem is that because we allow BTL to call opal_progress() internally the
request may be already completed on MPI and MPL levels and freed before return 
from
the call to btl_send().

I did a code review to see how hard it will be to get rid of recursion
in Open MPI and I think this is doable. We have to disallow calling
progress() (or other functions that may call progress() internally) from
BTL and from ULP callbacks that are called by BTL. There is no much
places that break this law. The main offenders are calls to
FREE_LIST_WAIT(), but those never actually call progress if they can
grow without limit and this is the most common use of FREE_LIST_WAIT()
so they may be safely changed to FREE_LIST_GET(). After we will solve
recursion problem the fix to the problem will be a couple of lines of
code.

> 
> - Galen 
> 
> 
> 
> On 10/11/07 11:26 AM, "Gleb Natapov"  wrote:
> 
> > On Fri, Oct 05, 2007 at 09:43:44AM +0200, Jeff Squyres wrote:
> >> David --
> >> 
> >> Gleb and I just actively re-looked at this problem yesterday; we
> >> think it's related to https://svn.open-mpi.org/trac/ompi/ticket/
> >> 1015.  We previously thought this ticket was a different problem, but
> >> our analysis yesterday shows that it could be a real problem in the
> >> openib BTL or ob1 PML (kinda think it's the openib btl because it
> >> doesn't seem to happen on other networks, but who knows...).
> >> 
> >> Gleb is investigating.
> > Here is the result of the investigation. The problem is different than
> > #1015 ticket. What we have here is one rank calls isend() of a small
> > message and wait_all() in a loop and another one calls irecv(). The
> > problem is that isend() usually doesn't call opal_progress() anywhere
> > and wait_all() doesn't call progress if all requests are already completed
> > so messages are never progressed. We may force opal_progress() to be called
> > by setting btl_openib_free_list_max to 1000. Then wait_all() will call
> > progress because not every request will be immediately completed by OB1. Or
> > we can limit a number of uncompleted requests that OB1 can allocate by 
> > setting
> > pml_ob1_free_list_max to 1000. Then opal_progress() will be called from a
> > free_list_wait() when max will be reached. The second option works much
> > faster for me.
> > 
> >> 
> >> 
> >> 
> >> On Oct 5, 2007, at 12:59 AM, David Daniel wrote:
> >> 
> >>> Hi Folks,
> >>> 
> >>> I have been seeing some nasty behaviour in collectives,
> >>> particularly bcast and reduce.  Attached is a reproducer (for bcast).
> >>> 
> >>> The code will rapidly slow to a crawl (usually interpreted as a
> >>> hang in real applications) and sometimes gets killed with sigbus or
> >>> sigterm.
> >>> 
> >>> I see this with
> >>> 
> >>>   openmpi-1.2.3 or openmpi-1.2.4
> >>>   ofed 1.2
> >>>   linux 2.6.19 + patches
> >>>   gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2)
> >>>   4 socket, dual core opterons
> >>> 
> >>> run as
> >>> 
> >>>   mpirun --mca btl self,openib --npernode 1 --np 4 bcast-hang
> >>> 
> >>> To my now uneducated eye it looks as if the root process is rushing
> >>> ahead and not progressing earlier bcasts.
> >>> 
> >>> Anyone else seeing similar?  Any ideas for workarounds?
> >>> 
> >>> As a point of reference, mvapich2 0.9.8 works fine.
> >>> 
> >>> Thanks, David
> 

Re: [OMPI devel] putting common request completion waiting code into separate inline function

2007-10-18 Thread Gleb Natapov
On Wed, Oct 17, 2007 at 05:32:47PM -0400, Jeff Squyres wrote:
> Gleb -
> 
> I am not overly familiar with all these portions of the pml code  
> base, but it looks like not all of these places have exactly the same  
> code: the inline version is much shorter than some of the original  
> pml codes that it replaced.  Is the logic equivalent?
> 
My claim is that the logic is equivalent :) But I am asking here for
others to comment.

In most places the logic is like this:
if (req is not completed) {
if (more then one thread) {
   acquire lock
   wait for completion
   release lock
} else {
   wait for completion
}
}

inline function does:
if (req is not completed) {
if (more then one thread) {
   acquire lock
}
wait for completion
if (more then one thread) {
   release lock
}
}

And in non threaded build both "if (one the one thread){}" statements are
removed by preprocessor.

> Also, a minor nit -- it would be nice if the new inline function  
> conformed to our coding standards (constants on the left of ==, {}  
> around all blocks, etc.).  :-)
OK. The new code mimics the code it replaces :)

> 
> 
> On Oct 15, 2007, at 10:27 AM, Gleb Natapov wrote:
> 
> > Hi,
> >
> >Each time a someone needs to wait for request completion he
> > implements the same piece of code. Why not put this code into
> > inline function and use it instead. Look at the included patch, it
> > moves the common code into ompi_request_wait_completion() function.
> > Does somebody have any objection against committing it to the trunk?
> >
> > diff --git a/ompi/mca/crcp/coord/crcp_coord_pml.c b/ompi/mca/crcp/ 
> > coord/crcp_coord_pml.c
> > index b2392e4..eb9b9c1 100644
> > --- a/ompi/mca/crcp/coord/crcp_coord_pml.c
> > +++ b/ompi/mca/crcp/coord/crcp_coord_pml.c
> > @@ -3857,13 +3857,7 @@ static int coord_request_wait_all( size_t  
> > count,
> >  static int coord_request_wait( ompi_request_t * req,
> > ompi_status_public_t * status)
> >  {
> > -OPAL_THREAD_LOCK(&ompi_request_lock);
> > -ompi_request_waiting++;
> > -while (req->req_complete == false) {
> > -opal_condition_wait(&ompi_request_cond, &ompi_request_lock);
> > -}
> > -ompi_request_waiting--;
> > -OPAL_THREAD_UNLOCK(&ompi_request_lock);
> > +ompi_request_wait_completion(req);
> >
> >  if( MPI_STATUS_IGNORE != status ) {
> >  status->MPI_TAG= req->req_status.MPI_TAG;
> > diff --git a/ompi/mca/pml/cm/pml_cm_recv.c b/ompi/mca/pml/cm/ 
> > pml_cm_recv.c
> > index 0e23c9a..00efffc 100644
> > --- a/ompi/mca/pml/cm/pml_cm_recv.c
> > +++ b/ompi/mca/pml/cm/pml_cm_recv.c
> > @@ -112,22 +112,7 @@ mca_pml_cm_recv(void *addr,
> >  return ret;
> >  }
> >
> > -if (recvreq->req_base.req_ompi.req_complete == false) {
> > -/* give up and sleep until completion */
> > -if (opal_using_threads()) {
> > -opal_mutex_lock(&ompi_request_lock);
> > -ompi_request_waiting++;
> > -while (recvreq->req_base.req_ompi.req_complete == false)
> > -opal_condition_wait(&ompi_request_cond,  
> > &ompi_request_lock);
> > -ompi_request_waiting--;
> > -opal_mutex_unlock(&ompi_request_lock);
> > -} else {
> > -ompi_request_waiting++;
> > -while (recvreq->req_base.req_ompi.req_complete == false)
> > -opal_condition_wait(&ompi_request_cond,  
> > &ompi_request_lock);
> > -ompi_request_waiting--;
> > -}
> > -}
> > +ompi_request_wait_completion(&recvreq->req_base.req_ompi);
> >
> >  if (NULL != status) {  /* return status */
> >  *status = recvreq->req_base.req_ompi.req_status;
> > diff --git a/ompi/mca/pml/cm/pml_cm_send.c b/ompi/mca/pml/cm/ 
> > pml_cm_send.c
> > index ed9b189..f7d2e8c 100644
> > --- a/ompi/mca/pml/cm/pml_cm_send.c
> > +++ b/ompi/mca/pml/cm/pml_cm_send.c
> > @@ -175,23 +175,8 @@ mca_pml_cm_send(void *buf,
> >  MCA_PML_CM_THIN_SEND_REQUEST_RETURN(sendreq);
> >  return ret;
> >  }
> > -
> > -if (sendreq->req_send.req_base.req_ompi.req_complete  
> > == false) {
> > -/* give up and sleep until completion */
> > -if (opal_using_threads()) {
> > -opal_mutex_lock(&ompi_request_lock);
> > -  

[OMPI devel] putting common request completion waiting code into separate inline function

2007-10-15 Thread Gleb Natapov
Hi,

   Each time a someone needs to wait for request completion he 
implements the same piece of code. Why not put this code into
inline function and use it instead. Look at the included patch, it
moves the common code into ompi_request_wait_completion() function.
Does somebody have any objection against committing it to the trunk?

diff --git a/ompi/mca/crcp/coord/crcp_coord_pml.c 
b/ompi/mca/crcp/coord/crcp_coord_pml.c
index b2392e4..eb9b9c1 100644
--- a/ompi/mca/crcp/coord/crcp_coord_pml.c
+++ b/ompi/mca/crcp/coord/crcp_coord_pml.c
@@ -3857,13 +3857,7 @@ static int coord_request_wait_all( size_t count,
 static int coord_request_wait( ompi_request_t * req,
ompi_status_public_t * status)
 {
-OPAL_THREAD_LOCK(&ompi_request_lock);
-ompi_request_waiting++;
-while (req->req_complete == false) {
-opal_condition_wait(&ompi_request_cond, &ompi_request_lock);
-}
-ompi_request_waiting--;
-OPAL_THREAD_UNLOCK(&ompi_request_lock);
+ompi_request_wait_completion(req);

 if( MPI_STATUS_IGNORE != status ) {
 status->MPI_TAG= req->req_status.MPI_TAG;
diff --git a/ompi/mca/pml/cm/pml_cm_recv.c b/ompi/mca/pml/cm/pml_cm_recv.c
index 0e23c9a..00efffc 100644
--- a/ompi/mca/pml/cm/pml_cm_recv.c
+++ b/ompi/mca/pml/cm/pml_cm_recv.c
@@ -112,22 +112,7 @@ mca_pml_cm_recv(void *addr,
 return ret;
 }

-if (recvreq->req_base.req_ompi.req_complete == false) {
-/* give up and sleep until completion */
-if (opal_using_threads()) {
-opal_mutex_lock(&ompi_request_lock);
-ompi_request_waiting++;
-while (recvreq->req_base.req_ompi.req_complete == false)
-opal_condition_wait(&ompi_request_cond, &ompi_request_lock);
-ompi_request_waiting--;
-opal_mutex_unlock(&ompi_request_lock);
-} else {
-ompi_request_waiting++;
-while (recvreq->req_base.req_ompi.req_complete == false)
-opal_condition_wait(&ompi_request_cond, &ompi_request_lock);
-ompi_request_waiting--;
-}
-}
+ompi_request_wait_completion(&recvreq->req_base.req_ompi);

 if (NULL != status) {  /* return status */
 *status = recvreq->req_base.req_ompi.req_status;
diff --git a/ompi/mca/pml/cm/pml_cm_send.c b/ompi/mca/pml/cm/pml_cm_send.c
index ed9b189..f7d2e8c 100644
--- a/ompi/mca/pml/cm/pml_cm_send.c
+++ b/ompi/mca/pml/cm/pml_cm_send.c
@@ -175,23 +175,8 @@ mca_pml_cm_send(void *buf,
 MCA_PML_CM_THIN_SEND_REQUEST_RETURN(sendreq);
 return ret;
 }
-
-if (sendreq->req_send.req_base.req_ompi.req_complete == false) {
-/* give up and sleep until completion */
-if (opal_using_threads()) {
-opal_mutex_lock(&ompi_request_lock);
-ompi_request_waiting++;
-while (sendreq->req_send.req_base.req_ompi.req_complete == 
false)
-opal_condition_wait(&ompi_request_cond, 
&ompi_request_lock);
-ompi_request_waiting--;
-opal_mutex_unlock(&ompi_request_lock);
-} else {
-ompi_request_waiting++;
-while (sendreq->req_send.req_base.req_ompi.req_complete == 
false)
-opal_condition_wait(&ompi_request_cond, 
&ompi_request_lock);
-ompi_request_waiting--;
-}
-}
+   
+ompi_request_wait_completion(&sendreq->req_send.req_base.req_ompi);

 ompi_request_free( (ompi_request_t**)&sendreq );
 } else {
diff --git a/ompi/mca/pml/dr/pml_dr_iprobe.c b/ompi/mca/pml/dr/pml_dr_iprobe.c
index 9149174..2063c54 100644
--- a/ompi/mca/pml/dr/pml_dr_iprobe.c
+++ b/ompi/mca/pml/dr/pml_dr_iprobe.c
@@ -64,22 +64,7 @@ int mca_pml_dr_probe(int src,
 MCA_PML_DR_RECV_REQUEST_INIT(&recvreq, NULL, 0, &ompi_mpi_char, src, tag, 
comm, true);
 MCA_PML_DR_RECV_REQUEST_START(&recvreq);

-if (recvreq.req_recv.req_base.req_ompi.req_complete == false) {
-/* give up and sleep until completion */
-if (opal_using_threads()) {
-opal_mutex_lock(&ompi_request_lock);
-ompi_request_waiting++;
-while (recvreq.req_recv.req_base.req_ompi.req_complete == false)
-opal_condition_wait(&ompi_request_cond, &ompi_request_lock);
-ompi_request_waiting--;
-opal_mutex_unlock(&ompi_request_lock);
-} else {
-ompi_request_waiting++;
-while (recvreq.req_recv.req_base.req_ompi.req_complete == false)
-opal_condition_wait(&ompi_request_cond, &ompi_request_lock);
-ompi_request_waiting--;
-}
-}
+ompi_request_wait_completion(&recvreq.req_recv.req_base.req_ompi);

 if (NULL != status) {
 *status = recvreq.req_recv.req_base.req_ompi.req_status;
@@ 

Re: [OMPI devel] collective problems

2007-10-11 Thread Gleb Natapov
On Fri, Oct 05, 2007 at 09:43:44AM +0200, Jeff Squyres wrote:
> David --
> 
> Gleb and I just actively re-looked at this problem yesterday; we  
> think it's related to https://svn.open-mpi.org/trac/ompi/ticket/ 
> 1015.  We previously thought this ticket was a different problem, but  
> our analysis yesterday shows that it could be a real problem in the  
> openib BTL or ob1 PML (kinda think it's the openib btl because it  
> doesn't seem to happen on other networks, but who knows...).
> 
> Gleb is investigating.
Here is the result of the investigation. The problem is different than
#1015 ticket. What we have here is one rank calls isend() of a small
message and wait_all() in a loop and another one calls irecv(). The
problem is that isend() usually doesn't call opal_progress() anywhere
and wait_all() doesn't call progress if all requests are already completed
so messages are never progressed. We may force opal_progress() to be called
by setting btl_openib_free_list_max to 1000. Then wait_all() will call
progress because not every request will be immediately completed by OB1. Or
we can limit a number of uncompleted requests that OB1 can allocate by setting
pml_ob1_free_list_max to 1000. Then opal_progress() will be called from a
free_list_wait() when max will be reached. The second option works much
faster for me.

> 
> 
> 
> On Oct 5, 2007, at 12:59 AM, David Daniel wrote:
> 
> > Hi Folks,
> >
> > I have been seeing some nasty behaviour in collectives,  
> > particularly bcast and reduce.  Attached is a reproducer (for bcast).
> >
> > The code will rapidly slow to a crawl (usually interpreted as a  
> > hang in real applications) and sometimes gets killed with sigbus or  
> > sigterm.
> >
> > I see this with
> >
> >   openmpi-1.2.3 or openmpi-1.2.4
> >   ofed 1.2
> >   linux 2.6.19 + patches
> >   gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2)
> >   4 socket, dual core opterons
> >
> > run as
> >
> >   mpirun --mca btl self,openib --npernode 1 --np 4 bcast-hang
> >
> > To my now uneducated eye it looks as if the root process is rushing  
> > ahead and not progressing earlier bcasts.
> >
> > Anyone else seeing similar?  Any ideas for workarounds?
> >
> > As a point of reference, mvapich2 0.9.8 works fine.
> >
> > Thanks, David
> >
> >
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> Cisco Systems
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] osu_bibw failing for message sizes 2097152 and larger

2007-09-19 Thread Gleb Natapov
On Wed, Sep 19, 2007 at 10:26:15AM -0400, Dan Lacher wrote:
> In doing some runs with the osu_bibw test on a single node, we have 
> found that it hands when using the trunk for message sizes 2097152 or 
> larger unless the mpool_sm_min_size is set to a number larger than the 
> message size.  We are not seeing this issue in the 1.2 branch.  Just 
> checking to see if I am missed something or if I should be filing a 
> defect to trac this issue.
> 
I can't reproduce this.  "mpirun -np 2 ./osu_bibw" works for me.

--
Gleb.


Re: [OMPI devel] Commit r16105

2007-09-18 Thread Gleb Natapov
On Tue, Sep 18, 2007 at 10:57:38AM -0400, George Bosilca wrote:
> More information about this can be founded in the trac #1127
> (https://svn.open-mpi.org/trac/ompi/ticket/1127).
>
OK. So the code I cited is only a temporary solution. Thanks.

>   george.
>
> On Sep 18, 2007, at 10:20 AM, Gleb Natapov wrote:
>
>> On Tue, Sep 18, 2007 at 09:44:42AM -0400, George Bosilca wrote:
>>> The setup of a communicators include as a last stage, a collective
>>> communication. As a result, some of the nodes can exit the collective
>>> before the others and therefore can start sending messages using this
>>> communicator [while some of the other nodes are still waiting for the
>>> collective completion]. This will lead to a situation where a node 
>>> receive
>>> a message for a communicator that they are building up.
>>>
>>> There is a bug filled in trac about this. In FT-MPI we temporary put 
>>> these
>>> messages in an internal queue, and deliver them to the right communicator
>>> only once this communicator is completely created.
>> In ompi_comm_nextcid() function there is this code for thread_multiple
>> case:
>>
>>  /* for synchronization purposes, avoids receiving fragments for
>> a communicator id, which might not yet been known. For single-threaded
>> scenarios, this call is in ompi_comm_activate, for multi-threaded
>> scenarios, it has to be already here ( before releasing another
>> thread into the cid-allocation loop ) */
>>  (allredfnct)(&response, &glresponse, 1, MPI_MIN, comm, bridgecomm,
>>  local_leader, remote_leader, send_first );
>>
>> This collective is executed on old communicator after setup of a new
>> cid. Is this not enough to solve the problem? Some ranks may leave
>> this collective call earlier than others, but none can leave it before
>> all ranks enter it and at this stage new communicator is already exists
>> in all of them. Do I miss something?
>>
>>
>>>
>>>   george.
>>>
>>> On Sep 18, 2007, at 9:06 AM, Gleb Natapov wrote:
>>>
>>>> George,
>>>>
>>>> In the comment you are saying that "a message for a not yet existing
>>>> communicator can happen". Can you explain in what situation it can
>>>> happen?
>>>>
>>>> Thanks,
>>>>
>>>> --
>>>>Gleb.
>>>> ___
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> --
>>  Gleb.
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] Commit r16105

2007-09-18 Thread Gleb Natapov
On Tue, Sep 18, 2007 at 09:44:42AM -0400, George Bosilca wrote:
> The setup of a communicators include as a last stage, a collective 
> communication. As a result, some of the nodes can exit the collective 
> before the others and therefore can start sending messages using this 
> communicator [while some of the other nodes are still waiting for the 
> collective completion]. This will lead to a situation where a node receive 
> a message for a communicator that they are building up.
>
> There is a bug filled in trac about this. In FT-MPI we temporary put these 
> messages in an internal queue, and deliver them to the right communicator 
> only once this communicator is completely created.
In ompi_comm_nextcid() function there is this code for thread_multiple
case:

 /* for synchronization purposes, avoids receiving fragments for 
a communicator id, which might not yet been known. For single-threaded
scenarios, this call is in ompi_comm_activate, for multi-threaded
scenarios, it has to be already here ( before releasing another
thread into the cid-allocation loop ) */
 (allredfnct)(&response, &glresponse, 1, MPI_MIN, comm, bridgecomm,
 local_leader, remote_leader, send_first );

This collective is executed on old communicator after setup of a new
cid. Is this not enough to solve the problem? Some ranks may leave
this collective call earlier than others, but none can leave it before
all ranks enter it and at this stage new communicator is already exists
in all of them. Do I miss something?


>
>   george.
>
> On Sep 18, 2007, at 9:06 AM, Gleb Natapov wrote:
>
>> George,
>>
>> In the comment you are saying that "a message for a not yet existing
>> communicator can happen". Can you explain in what situation it can
>> happen?
>>
>> Thanks,
>>
>> --
>>  Gleb.
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


[OMPI devel] Commit r16105

2007-09-18 Thread Gleb Natapov
George,

In the comment you are saying that "a message for a not yet existing
communicator can happen". Can you explain in what situation it can
happen?

Thanks,

--
Gleb.


  1   2   3   >