Re: [OMPI devel] IB pow wow notes

2007-12-06 Thread Jeff Squyres

On Dec 2, 2007, at 5:11 PM, Richard Graham wrote:

One question – there is a mention a new pml that is essentially CM 
+matching.

Why is this no just another instance of CM ?


I'm not sure I understand your question -- the new proposed PML would  
be different than CM: it would have matching and support more than one  
underlying device (e.g., more than one MTL).


Could this just be CM with some run-time parameter enabled?   
Possibly.  Is it worth it?  I'm not sure -- CM is nice in that it's so  
small / simple.  Do we really want to make it more complex?


All of this is speculation / vaporware at the moment anyway -- just  
tossing around some ideas...


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] IB pow wow notes

2007-12-05 Thread Richard Graham
One question ­ there is a mention a new pml that is essentially CM+matching.
Why is this no just another instance of CM ?

Rich


On 11/26/07 7:54 PM, "Jeff Squyres"  wrote:

> OMPI OF Pow Wow Notes
> 26 Nov 2007
> 
> ---
> 
> Discussion of current / upcoming work:
> 
> OCG (Chelsio):
> - Did a bunch of udapl work, but abandoned it.  Will commit it to a
>tmp branch if anyone cares (likely not).
> - They have been directed to move to the verbs API; will be starting
>on that next week.
> 
> Cisco:
> - likely to get more involved in PML/BTL issues since Galen + Brian
>now transferring out of these areas.
> - race between Chelsio / Cisco as to who implements RDMA CM connect PC
>first (more on this below).  May involve some changes to the connect
>PC interface.
> - Re-working libevent and progress engine stuff with George
> 
> LLNL:
> - Andrew Friedley leaving LLNL in 3 weeks
> - UD code more of less functional, but working on reliability stuff
>down in the BTL.  That part is not quite working yet.
> - When he leaves LLNL, UD BTL may become unmaintained.
> 
> IBM:
> - Has an interest in NUNAs
> - May be interested in maintaining the UD BTL; worried about scale.
> 
> Mellanox:
> - Just finished first implementation of XRC
> - Now working on QA issues with XRC: testing with multiple subnets,
>different numbers of HCAs/ports on different hosts, etc.
> 
> Sun:
> - Currently working full steam ahead on UDAPL.
> - Will likely join in openib BTL/etc. when Sun's verbs stack is ready.
> - Will *hopefully* get early access to Sun's verbs stack for testing /
>compatibility issues before the stack becomes final.
> 
> ORNL:
> - Mostly working on non-PML/BTL issues these days.
> - Galen's advice: get progress thread working for best IB overlap and
>real application performance.
> 
> Voltaire:
> - Working on XRC improvements
> - Working on message coalescing.  Only sees benefit if you drastically
>reduce the number of available buffers and credits -- i.e., be much
>more like openib BTL before BSRQ (2 buffer sizes: large and small,
>and have very few small buffer credits).
>least an FAQ item to explain all the trade-offs.  There could be
>non-artificial benefits for coalescing at scale because of limiting
>the number of credits>
> - Moving HCA initializing stuff to a common area to share it with
>collective components.
> 
> ---
> 
> Discussion of various "moving forward" proposals:
> 
> - ORNL, Cisco, Mellanox discussing how to reduce cost of memory
>registration.  Currently running some benchmarks to figure out where
>the bottlenecks are.  Cheap registration would *help* (but not
>completely solve) overlap issues by reducing the number of sync
>points -- e.g., always just do one big RDMA operation (vs. the
>pipeline protocol).
> 
> - Some discussion of a UD-based connect PC.  Gleb mentions that what
>was proposed is effectively the same as the IBTA CM (i.e., ibcm).
>Jeff will go investigate.
> 
> - Gleb also mentions that the connect PC needs to be based on the
>endpoint, not the entire module (for non-uniform hardware
>networks).  Jeff took a to-do item to fix.  Probably needs to be
>done for v1.3.
> 
> - To UD or not to UD?  Lots of discussion on this.
> 
>- Some data has been presented by OSU showing that UD drops don't
>  happen often.  Their tests were run in a large non-blocking
>  network.  Some in the group feel that in a busy blocking network,
>  UD drops will be [much] more common (there is at least some
>  evidence that in a busy non-blocking network, drops *are* rare).
>  This issue affects how we design the recovery of UD drops: if
>  drops are rare, recovery can be arbitrarily expensive.  If drops
>  are not rare, recovery needs to be at least somewhat efficient.
>  If drops are frequent, recovery needs to be cheap/fast/easy.
> 
>- Mellanox is investigating why ibv_rc_pingpong gives better
>  bandwidth than ibv_ud_pingpong (i.e., UD bandwidth is poor).
> 
>- Discuss the possibility of doing connection caching: only allow so
>  many RC connections at a time.  Maintain an LRU of RC connections.
>  When you need to close one, also recycle (or free) all of its
>  posted buffers.
> 
>- Discussion of MVAPICH technique for large UD messages: "[receiver]
>  zero copy UD".  Send a match header; receiver picks a UD QP from a
>  ready pool and sends it back to the sender.  Fragments from the
>  user's buffer are posted to that QP on the receiver, so the sender
>  sends straight into the receiver's target buffer.  This scheme
>  assumes no drops.  For OMPI, this scheme also requires more
>  complexity from our current multi-device striping method: we'd
>  want to