Re: [OMPI devel] RTE issue I. Support for non-MPI jobs

2007-12-05 Thread Ralph H Castain



On 12/5/07 7:58 AM, "rolf.vandeva...@sun.com" 
wrote:

> Ralph H Castain wrote:
> 
>> I. Support for non-MPI jobs
>> Considerable complexity currently exists in ORTE because of the stipulation
>> in our first requirements document that users be able to mpirun non-MPI jobs
>> - i.e., that we support such calls as "mpirun -n 100 hostname". This creates
>> a situation, however, where the RTE cannot know if the application will call
>> MPI_Init (or at least orte_init), which has significant implications to the
>> RTE's architecture. For example, during the launch of the application's
>> processes, the RTE cannot go into any form of blocking receive while waiting
>> for the procs to report a successful startup as this won't occur for
>> execution of something like "hostname".
>> 
>> Jeff has noted that support for non-MPI jobs is not something most (all?)
>> MPIs currently provide, nor something that users are likely to exploit as
>> they can more easily just "qsub hostname" (or the equivalent for that
>> environment). While nice for debugging purposes, therefore, it isn't clear
>> that supporting non-MPI jobs is worth the increased code complexity and
>> fragility.
>> 
>> In addition, the fact that we do not know if a job will call Init limits our
>> ability to do collective communications within the RTE, and hence our
>> scalability - see the note on that specific subject for further discussion
>> on this area.
>> 
>> This would be a "regression" in behavior, though, so the questions for the
>> community are:
>> 
>> (a) do we want to retain the feature to run non-MPI jobs with mpirun as-is
>> (and accept the tradeoffs, including the one described below in II)?
>>  
>> 
> Hi Ralph:
>  From a user standpoint, a) would be preferable.  However, as you point
> out, there are issues.  Are you saying that we cannot do collectives
> (Item III) if we preserve a?  Or is it that things will just be more
> complex.  I guess I am looking for more details about what the tradeoffs
> are for preserving a.

I believe it would be more accurate to say things would be more complex. I'm
not sure we know enough at the moment to say collectives can't be done at
all. All I can say is that I have spent a little time trying to define a
"snowball" collective (i.e., one that collects info from each daemon it
passes through to deliver the final collection to the HNP) and not knowing
whether or not a process is going to call orte_init is one of the hurdles. I
believe some clever programming could probably overcome it - at least, I'm
not willing to give up yet. It just will take additional effort and time.

Likewise for the overall logic in the system. The biggest problem with
supporting non-MPI apps is that you have to be -very- careful throughout the
RTE to avoid any blocking operations that depend upon the procs calling
orte_init or orte_finalize - while still requiring that if they -did- call
orte_init, then they must call orte_finalize or else we consider the
termination to have been an 'abort'. Not impossible - but it adds time spent
while making changes to the system, and always leaves open the door for a
deadlock condition to jump out of the bushes.

It can be done as we obviously currently do it, though I note it has taken
us nearly three years to identify and resolve all those deadlock scenarios
(fingers crossed). Just pointing out that it does introduce some complexity
that could potentially be removed from the current code base, and will make
those inbound collectives more difficult. So Jeff and I thought it would be
worth at least asking if this was a desirable feature we should preserve, or
something people really didn't care about or use, but was just another of
those leftover requirements from the early days.

Personally, I like the feature while debugging the system as it allows me to
test the underlying allocate/map/launch infrastructure without the OOB
wireup - but I can live without that if people would prefer that we further
simplify the code. I can also certainly use the switch to indicate "this is
a non-MPI app" when I'm debugging, but I consider that to be somewhat user
unfriendly...

...especially since if the user forgets the switch and mpirun's a non-MPI
job, we would have no choice but to "hang" until they ctrl-c the job, or
introduce some totally artificial timeout constraint!

So I guess my recommendation is: if you believe (a) is preferable from a
user's perspective, then I would preserve the feature "as-is" and accept the
code complexity and risk since that can be overcome with careful design and
testing until such time as we -prove- that inbound collectives cannot be
written under those conditions. I believe this last point is critical as we
really shouldn't accept linear scale-by-node as a limitation.

Now if I could just get some help on those inbound collectives so we can
resolve that point...but that was note III, I believe. ;-)



> 
> Having said that, we would probably be 

Re: [OMPI devel] RTE Issue IV: RTE/MPI relative modex responsibilities

2007-12-05 Thread Tim Prins

Well, I think it is pretty obvious that I am a fan of a attribute system :)

For completeness, I will point out that we also exchange architecture 
and hostname info in the modex.


Do we really need a complete node map? A far as I can tell, it looks 
like the MPI layer only needs a list of local processes. So maybe it 
would be better to forget about the node ids at the mpi layer and just 
return the local procs.


So my vote would be to leave the modex alone, but remove the node id, 
and add a function to get the list of local procs. It doesn't matter to 
me how the RTE implements that.


Alternatively, if we did a process attribute system we could just use 
predefined attributes, and the runtime can get each process's node id 
however it wants.


Tim

Ralph H Castain wrote:

IV. RTE/MPI relative modex responsibilities
The modex operation conducted during MPI_Init currently involves the
exchange of two critical pieces of information:

1. the location (i.e., node) of each process in my job so I can determine
who shares a node with me. This is subsequently used by the shared memory
subsystem for initialization and message routing; and

2. BTL contact info for each process in my job.

During our recent efforts to further abstract the RTE from the MPI layer, we
pushed responsibility for both pieces of information into the MPI layer.
This wasn't done capriciously - the modex has always included the exchange
of both pieces of information, and we chose not to disturb that situation.

However, the mixing of these two functional requirements does cause problems
when dealing with an environment such as the Cray where BTL information is
"exchanged" via an entirely different mechanism. In addition, it has been
noted that the RTE (and not the MPI layer) actually "knows" the node
location for each process.

Hence, questions have been raised as to whether:

(a) the modex should be built into a framework to allow multiple BTL
exchange mechansims to be supported, or some alternative mechanism be used -
one suggestion made was to implement an MPICH-like attribute exchange; and

(b) the RTE should absorb responsibility for providing a "node map" of the
processes in a job (note: the modex may -use- that info, but would no longer
be required to exchange it). This has a number of implications that need to
be carefully considered - e.g., the memory required to store the node map in
every process is non-zero. On the other hand:

(i) every proc already -does- store the node for every proc - it is simply
stored in the ompi_proc_t structures as opposed to somewhere in the RTE. We
would want to avoid duplicating that storage, but there would be no change
in memory footprint if done carefully.

(ii) every daemon already knows the node map for the job, so communicating
that info to its local procs may not prove a major burden. However, the very
environments where this subject may be an issue (e.g., the Cray) do not use
our daemons, so some alternative mechanism for obtaining the info would be
required.


So the questions to be considered here are:

(a) do we leave the current modex "as-is", to include exchange of the node
map, perhaps including "#if" statements to support different exchange
mechanisms?

(b) do we separate the two functions currently in the modex and push the
requirement to obtain a node map into the RTE? If so, how do we want the MPI
layer to retrieve that info so we avoid increasing our memory footprint?

(c) do we create a separate modex framework for handling the different
exchange mechanisms for BTL info, do we incorporate it into an existing one
(if so, which one), the new publish-subscribe framework, implement an
alternative approach, or...?

(d) other suggestions?

Ralph


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] RTE Issue IV: RTE/MPI relative modex responsibilities

2007-12-05 Thread Ralph H Castain



On 12/5/07 8:48 AM, "Tim Prins"  wrote:

> Well, I think it is pretty obvious that I am a fan of a attribute system :)
> 
> For completeness, I will point out that we also exchange architecture
> and hostname info in the modex.

True - except we should note that hostname info is only exchanged if someone
specifically requests it.

> 
> Do we really need a complete node map? A far as I can tell, it looks
> like the MPI layer only needs a list of local processes. So maybe it
> would be better to forget about the node ids at the mpi layer and just
> return the local procs.

I agree, though I don't think we want a parallel list of procs. We just need
to set the "local" flag in the existing ompi_proc_t structures.

One option is for the RTE to just pass in an enviro variable with a
comma-separated list of your local ranks, but that creates a problem down
the road when trying to integrate tighter with systems like SLURM where the
procs would get mass-launched (so the environment has to be the same for all
of them).

> 
> So my vote would be to leave the modex alone, but remove the node id,
> and add a function to get the list of local procs. It doesn't matter to
> me how the RTE implements that.

I think we would need to be careful here that we don't create a need for
more communication. We have two functions currently in the modex:

1. how to exchange the info required to populate the ompi_proc_t structures;
and

2. how to identify which of those procs are "local"

The problem with leaving the modex as it currently sits is that some
environments require a different mechanism for exchanging the ompi_proc_t
info. While most can use the RML, some can't. The same division of
capabilities applies to getting the "local" info, so it makes sense to me to
put the modex in a framework.

Otherwise, we wind up with a bunch of #if's in the code to support
environments like the Cray. I believe the mca system was put in place
precisely to avoid those kind of practices, so it makes sense to me to take
advantage of it.


> 
> Alternatively, if we did a process attribute system we could just use
> predefined attributes, and the runtime can get each process's node id
> however it wants.

Same problem as above, isn't it? Probably ignorance on my part, but it seems
to me that we simply exchange a modex framework for an attribute framework
(since each environment would have to get the attribute values in a
different manner) - don't we?

I have no problem with using attributes instead of the modex, but the issue
appears to be the same either way - you still need a framework to handle the
different methods.


Ralph

> 
> Tim
> 
> Ralph H Castain wrote:
>> IV. RTE/MPI relative modex responsibilities
>> The modex operation conducted during MPI_Init currently involves the
>> exchange of two critical pieces of information:
>> 
>> 1. the location (i.e., node) of each process in my job so I can determine
>> who shares a node with me. This is subsequently used by the shared memory
>> subsystem for initialization and message routing; and
>> 
>> 2. BTL contact info for each process in my job.
>> 
>> During our recent efforts to further abstract the RTE from the MPI layer, we
>> pushed responsibility for both pieces of information into the MPI layer.
>> This wasn't done capriciously - the modex has always included the exchange
>> of both pieces of information, and we chose not to disturb that situation.
>> 
>> However, the mixing of these two functional requirements does cause problems
>> when dealing with an environment such as the Cray where BTL information is
>> "exchanged" via an entirely different mechanism. In addition, it has been
>> noted that the RTE (and not the MPI layer) actually "knows" the node
>> location for each process.
>> 
>> Hence, questions have been raised as to whether:
>> 
>> (a) the modex should be built into a framework to allow multiple BTL
>> exchange mechansims to be supported, or some alternative mechanism be used -
>> one suggestion made was to implement an MPICH-like attribute exchange; and
>> 
>> (b) the RTE should absorb responsibility for providing a "node map" of the
>> processes in a job (note: the modex may -use- that info, but would no longer
>> be required to exchange it). This has a number of implications that need to
>> be carefully considered - e.g., the memory required to store the node map in
>> every process is non-zero. On the other hand:
>> 
>> (i) every proc already -does- store the node for every proc - it is simply
>> stored in the ompi_proc_t structures as opposed to somewhere in the RTE. We
>> would want to avoid duplicating that storage, but there would be no change
>> in memory footprint if done carefully.
>> 
>> (ii) every daemon already knows the node map for the job, so communicating
>> that info to its local procs may not prove a major burden. However, the very
>> environments where this subject may be an issue (e.g., the Cray) do not use
>> our daemons, so 

Re: [OMPI devel] RTE Issue II: Interaction between the ROUTED and GRPCOMM frameworks

2007-12-05 Thread Brian W. Barrett

To me, (a) is dumb and (c) isn't a non-starter.

The whole point of the component system is to seperate concerns.  Routing 
topology and collectives operations are two difference concerns.  While 
there's some overlap (a topology-aware collective doesn't make sense when 
using the unity routing structure), it's not overlap in the one implies 
you need the other.  I can think of a couple of different ways of 
implementing the group communication framework, all of which are totally 
independent of the particulars of how routing is tracked.


(b) has a very reasonable track record of working well on the OMPI side 
(the mpool / btl thing figures itself out fairly well).  Bringing such a 
setup over to ORTE wouldn't be bad, but a bit hackish.


Of course, there's at most two routed components built at any time, and 
the defaults are all most non-debugging people will ever need, so I guess 
I"m not convinced (c) isn't a non-starter.


Brian

On Wed, 5 Dec 2007, Tim Prins wrote:


To me, (c) is a non-starter. I think whenever possible we should be
automatically doing the right thing. The user should not need to have
any idea how things work inside the library.

Between options (a) and (b), I don't really care.

(b) would be great if we had a mca component dependency system which has
been much talked about. But without such a system it gets messy.

(a) has the advantage of making sure there is no problems and allowing
the 2 systems to interact very nicely together, but it also might add a
large burden to a component writer.

On a related, but slightly different topic, one thing that has always
bothered me about the grpcomm/routed implementation is that it is not
self contained. There is logic for routing algorithms outside of the
components (for example, in orte/orted/orted_comm.c). So, if there are
any overhauls planned I definitely think this needs to be cleaned up.

Thanks,

Tim

Ralph H Castain wrote:

II. Interaction between the ROUTED and GRPCOMM frameworks
When we initially developed these two frameworks within the RTE, we
envisioned them to operate totally independently of each other. Thus, the
grpcomm collectives provide algorithms such as a binomial "xcast" that uses
the daemons to scalably send messages across the system.

However, we recently realized that the efficacy of the current grpcomm
algorithms directly hinge on the daemons being fully connected - which we
were recently told may not be the case as other people introduce different
ROUTED components. For example, using the binomial algorithm in grpcomm's
xcast while having a ring topology selected in ROUTED would likely result in
terrible performance.

This raises the following questions:

(a) should the GRPCOMM and ROUTED frameworks be consolidated to ensure that
the group collectives algorithms properly "match" the communication
topology?

(b) should we automatically select the grpcomm/routed pairings based on some
internal logic?

(c) should we leave this "as-is" and the user is responsible for making
intelligent choices (and for detecting when the performance is bad due to
this mismatch)?

(d) other suggestions?

Ralph


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Using MTT to test the newly added SCTP BTL

2007-12-05 Thread Karol Mroz
Hi...

Karol Mroz wrote:

> Removal of .ompi_ignore should not create build problems for anyone who
> is running without some form of SCTP support. To test this claim, we
> built Open MPI with .ompi_ignore removed and no SCTP support on both an
> ubuntu linux and an OSX machine. Both builds succeeded without any problem.

In light of the above, are there any objections to us removing the
.ompi_ignore file from the SCTP BTL code?

I tried to work around this problem by using a pre-installed version of
Open MPI to run MTT tests on (ibm tests initially) but all I get is a
short summary from MTT that things succeeded, instead of a detailed list
of specific test successes/failures as is shown when using a nightly
tarball. The 'tests' also complete much faster which sparks some concern
as to whether they were actually run. Furthermore, MTT puts the source
into a new 'random' directory prior to building (way around this?), so I
can't add the SCTP directory by hand, and then run the
build/installation phase. Adding the code on the fly during the
installation phase also does not work.

Any advice in this matter?

Thanks again everyone.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Karol Mroz
km...@cs.ubc.ca




[OMPI devel] 32-bit openib is broken on the trunk as of Nov 27th, r16799

2007-12-05 Thread Tim Mattox
Hello,
It appears that sometime after r16777, and by r16799, that something
was broken on the trunk's openib support for 32-bit builds.
The 64-bit tests all seem normal, as well as the 32-bit & 64-bit tests on
the 1.2 branch on the same machine (odin).

See this MTT results page permalink showing the 32-bit odin runs:
http://www.open-mpi.org/mtt/index.php?do_redir=468

Pasha & Gleb, you both did a variety of checkins in that svn r# range.
Do either of you have time to investigate this?

Here is a snippet from one randomly picked failed test (out of thousands):
[1,1][btl_openib_component.c:1665:btl_openib_module_progress] from
odin001 to: odin001 error
polling LP CQ with status LOCAL PROTOCOL ERROR status number 4 for
wr_id 141733120 opcode 128
qp_idx 3
--
mpirun has exited due to process rank 1 with PID 29761 on
node odin001 calling "abort". This will have caused other processes
in the application to be terminated by signals sent by mpirun
(as reported here).
--

Thanks, and happy bug hunting!
-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
I'm a bright... http://www.the-brights.net/


Re: [OMPI devel] RTE Issue II: Interaction between the ROUTED and GRPCOMM frameworks

2007-12-05 Thread Ralph H Castain
I'm not sure I would call (a) "dumb", but I would agree it isn't a desirable
option. ;-)

The issue isn't with the current two routed components. The issue arose
because additional routed components are about to be committed to the
system. None of those added components are fully connected - i.e., each
daemon only has sparse connections to its peers. Hence, the current grpcomm
collectives will cause performance issues.

Re-writing those collectives to be independent of sparse vs. fully connected
schemes is a non-trivial exercise. Do I hear a volunteer out there? ;-)

I could have just left this issue off-the-list, of course, and let the
authors of the new routed components figure out what was wrong and deal with
it. But I thought it would be more friendly to raise the point and see if
people had any suggestions on how to resolve the issue -before- it rears its
head.

So, having done so, perhaps the best solution is option (c) - and let anyone
who brings a new routing scheme into the system deal with the collective
mis-match problem.

As for the relaying operations in the orted noted by Tim: including the
relay operation in the grpcomm framework would be extremely difficult,
although I won't immediately rule it out as "impossible". The problem is
that the orted has to actually process the message - it doesn't just route
it on to some other destination like the RML does with a message. Thus, the
orted_comm code contains a "relay" function that receives the message,
processes it, and then sends it along according to whatever xmission
protocol was specified by the sender.

To move that code into grpcomm would require that the collectives put a flag
in the message indicating "intermediaries who are routing this message need
to process it first". Grpcomm would then have to include some mechanism for
me to indicate "if you are told to process a message first, then here is the
function you need to call". We would then have to add a mechanism in the RML
routing system that looks at the message for this flag and calls the
"process it" function before continuing the route.

I had considered the alternative of calling the routed component to get the
next recipient for the message (instead of computing it myself), which would
at least remove the computation of the next recipients from the orted. I
would think that would be a more feasible next step, though it would take
development of another routed component to support things like the binomial
xcast algorithm, and possibly a change to the routed framework API (since
algo's like binomial might have to return more than one "next recipient").
It also could get a little tricky as the routed component might have to
include logic to deal with some of the special use-cases currently handled
in grpcomm.

All of this is non-trivial, which is why nobody tried to do it! If you want
to tackle that area of the code, we would welcome the volunteer - all I ask
is that you do it in a tmp branch somewhere first so we can test it.

Ralph

On 12/5/07 9:29 AM, "Brian W. Barrett"  wrote:

> To me, (a) is dumb and (c) isn't a non-starter.
> 
> The whole point of the component system is to seperate concerns.  Routing
> topology and collectives operations are two difference concerns.  While
> there's some overlap (a topology-aware collective doesn't make sense when
> using the unity routing structure), it's not overlap in the one implies
> you need the other.  I can think of a couple of different ways of
> implementing the group communication framework, all of which are totally
> independent of the particulars of how routing is tracked.
> 
> (b) has a very reasonable track record of working well on the OMPI side
> (the mpool / btl thing figures itself out fairly well).  Bringing such a
> setup over to ORTE wouldn't be bad, but a bit hackish.
> 
> Of course, there's at most two routed components built at any time, and
> the defaults are all most non-debugging people will ever need, so I guess
> I"m not convinced (c) isn't a non-starter.
> 
> Brian
> 
> On Wed, 5 Dec 2007, Tim Prins wrote:
> 
>> To me, (c) is a non-starter. I think whenever possible we should be
>> automatically doing the right thing. The user should not need to have
>> any idea how things work inside the library.
>> 
>> Between options (a) and (b), I don't really care.
>> 
>> (b) would be great if we had a mca component dependency system which has
>> been much talked about. But without such a system it gets messy.
>> 
>> (a) has the advantage of making sure there is no problems and allowing
>> the 2 systems to interact very nicely together, but it also might add a
>> large burden to a component writer.
>> 
>> On a related, but slightly different topic, one thing that has always
>> bothered me about the grpcomm/routed implementation is that it is not
>> self contained. There is logic for routing algorithms outside of the
>> components (for example, in orte/orted/orted_comm.c). So, if there are
>> any 

[OMPI devel] opal_condition

2007-12-05 Thread Tim Prins

Hi,

Last night we had one of our threaded builds on the trunk hang when 
running make check on the test opal_condition in test/threads/


After running the test about 30-40 times, I was only able to get it to 
hang once. Looking at it is gdb, we get:


(gdb) info threads
  3 Thread 1084229984 (LWP 8450)  0x002a95e3bba9 in sched_yield () 
from /lib64/tls/libc.so.6

  2 Thread 1094719840 (LWP 8451)  0xff600012 in ?? ()
  1 Thread 182904955328 (LWP 8430)  0x002a9567309b in pthread_join 
() from /lib64/tls/libpthread.so.0

(gdb) thread 2
[Switching to thread 2 (Thread 1094719840 (LWP 8451))]#0 
0xff600012 in ?? ()

(gdb) bt
#0  0xff600012 in ?? ()
#1  0x0001 in ?? ()
#2  0x in ?? ()
(gdb) thread 1
[Switching to thread 1 (Thread 182904955328 (LWP 8430))]#0 
0x002a9567309b in pthread_join () from /lib64/tls/libpthread.so.0

(gdb) bt
#0  0x002a9567309b in pthread_join () from /lib64/tls/libpthread.so.0
#1  0x002a95794a7d in opal_thread_join () from 
/san/homedirs/mpiteam/mtt-runs/odin/20071204-Nightly/pb_2/installs/Bp80/src/openmpi-1.3a1r16847/opal/.libs/libopen-pal.so.0

#2  0x00401684 in main ()
(gdb) thread 3
[Switching to thread 3 (Thread 1084229984 (LWP 8450))]#0 
0x002a95e3bba9 in sched_yield () from /lib64/tls/libc.so.6

(gdb) bt
#0  0x002a95e3bba9 in sched_yield () from /lib64/tls/libc.so.6
#1  0x00401216 in thr1_run ()
#2  0x002a95672137 in start_thread () from /lib64/tls/libpthread.so.0
#3  0x002a95e53113 in clone () from /lib64/tls/libc.so.6
(gdb)


I know, this is not very helpful, but I have no idea what is going on. 
There have been no changes in this code area for a long time.


Has anyone else seen something like this? Any ideas what is going on?

Thanks,

Tim


[OMPI devel] [PATCH] openib btl: remove excess ompi_btl_openib_connect_base_open call

2007-12-05 Thread Jon Mason
There is a double call to ompi_btl_openib_connect_base_open in
mca_btl_openib_mca_setup_qps().  It looks like someone just forgot to
clean-up the previous call when they added the check for the return
code.

I ran a quick IMB test over IB to verify everything is still working.

Thanks,
Jon


Index: ompi/mca/btl/openib/btl_openib_mca.c
===
--- ompi/mca/btl/openib/btl_openib_mca.c(revision 16855)
+++ ompi/mca/btl/openib/btl_openib_mca.c(working copy)
@@ -672,10 +672,7 @@
 mca_btl_openib_component.credits_qp = smallest_pp_qp;

 /* Register any MCA params for the connect pseudo-components */
-
-ompi_btl_openib_connect_base_open();
-
-if ( OMPI_SUCCESS != ompi_btl_openib_connect_base_open())
+if (OMPI_SUCCESS != ompi_btl_openib_connect_base_open())
 goto error;

 ret = OMPI_SUCCESS;


Re: [OMPI devel] IB pow wow notes

2007-12-05 Thread Richard Graham
One question ­ there is a mention a new pml that is essentially CM+matching.
Why is this no just another instance of CM ?

Rich


On 11/26/07 7:54 PM, "Jeff Squyres"  wrote:

> OMPI OF Pow Wow Notes
> 26 Nov 2007
> 
> ---
> 
> Discussion of current / upcoming work:
> 
> OCG (Chelsio):
> - Did a bunch of udapl work, but abandoned it.  Will commit it to a
>tmp branch if anyone cares (likely not).
> - They have been directed to move to the verbs API; will be starting
>on that next week.
> 
> Cisco:
> - likely to get more involved in PML/BTL issues since Galen + Brian
>now transferring out of these areas.
> - race between Chelsio / Cisco as to who implements RDMA CM connect PC
>first (more on this below).  May involve some changes to the connect
>PC interface.
> - Re-working libevent and progress engine stuff with George
> 
> LLNL:
> - Andrew Friedley leaving LLNL in 3 weeks
> - UD code more of less functional, but working on reliability stuff
>down in the BTL.  That part is not quite working yet.
> - When he leaves LLNL, UD BTL may become unmaintained.
> 
> IBM:
> - Has an interest in NUNAs
> - May be interested in maintaining the UD BTL; worried about scale.
> 
> Mellanox:
> - Just finished first implementation of XRC
> - Now working on QA issues with XRC: testing with multiple subnets,
>different numbers of HCAs/ports on different hosts, etc.
> 
> Sun:
> - Currently working full steam ahead on UDAPL.
> - Will likely join in openib BTL/etc. when Sun's verbs stack is ready.
> - Will *hopefully* get early access to Sun's verbs stack for testing /
>compatibility issues before the stack becomes final.
> 
> ORNL:
> - Mostly working on non-PML/BTL issues these days.
> - Galen's advice: get progress thread working for best IB overlap and
>real application performance.
> 
> Voltaire:
> - Working on XRC improvements
> - Working on message coalescing.  Only sees benefit if you drastically
>reduce the number of available buffers and credits -- i.e., be much
>more like openib BTL before BSRQ (2 buffer sizes: large and small,
>and have very few small buffer credits).
>least an FAQ item to explain all the trade-offs.  There could be
>non-artificial benefits for coalescing at scale because of limiting
>the number of credits>
> - Moving HCA initializing stuff to a common area to share it with
>collective components.
> 
> ---
> 
> Discussion of various "moving forward" proposals:
> 
> - ORNL, Cisco, Mellanox discussing how to reduce cost of memory
>registration.  Currently running some benchmarks to figure out where
>the bottlenecks are.  Cheap registration would *help* (but not
>completely solve) overlap issues by reducing the number of sync
>points -- e.g., always just do one big RDMA operation (vs. the
>pipeline protocol).
> 
> - Some discussion of a UD-based connect PC.  Gleb mentions that what
>was proposed is effectively the same as the IBTA CM (i.e., ibcm).
>Jeff will go investigate.
> 
> - Gleb also mentions that the connect PC needs to be based on the
>endpoint, not the entire module (for non-uniform hardware
>networks).  Jeff took a to-do item to fix.  Probably needs to be
>done for v1.3.
> 
> - To UD or not to UD?  Lots of discussion on this.
> 
>- Some data has been presented by OSU showing that UD drops don't
>  happen often.  Their tests were run in a large non-blocking
>  network.  Some in the group feel that in a busy blocking network,
>  UD drops will be [much] more common (there is at least some
>  evidence that in a busy non-blocking network, drops *are* rare).
>  This issue affects how we design the recovery of UD drops: if
>  drops are rare, recovery can be arbitrarily expensive.  If drops
>  are not rare, recovery needs to be at least somewhat efficient.
>  If drops are frequent, recovery needs to be cheap/fast/easy.
> 
>- Mellanox is investigating why ibv_rc_pingpong gives better
>  bandwidth than ibv_ud_pingpong (i.e., UD bandwidth is poor).
> 
>- Discuss the possibility of doing connection caching: only allow so
>  many RC connections at a time.  Maintain an LRU of RC connections.
>  When you need to close one, also recycle (or free) all of its
>  posted buffers.
> 
>- Discussion of MVAPICH technique for large UD messages: "[receiver]
>  zero copy UD".  Send a match header; receiver picks a UD QP from a
>  ready pool and sends it back to the sender.  Fragments from the
>  user's buffer are posted to that QP on the receiver, so the sender
>  sends straight into the receiver's target buffer.  This scheme
>  assumes no drops.  For OMPI, this scheme also requires more
>  complexity from our current multi-device striping method: we'd
>  want to