One question there is a mention a new pml that is essentially CM+matching.
Why is this no just another instance of CM ?
Rich
On 11/26/07 7:54 PM, "Jeff Squyres" wrote:
> OMPI OF Pow Wow Notes
> 26 Nov 2007
>
> ---
>
> Discussion of current / upcoming work:
>
> OCG (Chelsio):
> - Did a bunch of udapl work, but abandoned it. Will commit it to a
>tmp branch if anyone cares (likely not).
> - They have been directed to move to the verbs API; will be starting
>on that next week.
>
> Cisco:
> - likely to get more involved in PML/BTL issues since Galen + Brian
>now transferring out of these areas.
> - race between Chelsio / Cisco as to who implements RDMA CM connect PC
>first (more on this below). May involve some changes to the connect
>PC interface.
> - Re-working libevent and progress engine stuff with George
>
> LLNL:
> - Andrew Friedley leaving LLNL in 3 weeks
> - UD code more of less functional, but working on reliability stuff
>down in the BTL. That part is not quite working yet.
> - When he leaves LLNL, UD BTL may become unmaintained.
>
> IBM:
> - Has an interest in NUNAs
> - May be interested in maintaining the UD BTL; worried about scale.
>
> Mellanox:
> - Just finished first implementation of XRC
> - Now working on QA issues with XRC: testing with multiple subnets,
>different numbers of HCAs/ports on different hosts, etc.
>
> Sun:
> - Currently working full steam ahead on UDAPL.
> - Will likely join in openib BTL/etc. when Sun's verbs stack is ready.
> - Will *hopefully* get early access to Sun's verbs stack for testing /
>compatibility issues before the stack becomes final.
>
> ORNL:
> - Mostly working on non-PML/BTL issues these days.
> - Galen's advice: get progress thread working for best IB overlap and
>real application performance.
>
> Voltaire:
> - Working on XRC improvements
> - Working on message coalescing. Only sees benefit if you drastically
>reduce the number of available buffers and credits -- i.e., be much
>more like openib BTL before BSRQ (2 buffer sizes: large and small,
>and have very few small buffer credits).
>least an FAQ item to explain all the trade-offs. There could be
>non-artificial benefits for coalescing at scale because of limiting
>the number of credits>
> - Moving HCA initializing stuff to a common area to share it with
>collective components.
>
> ---
>
> Discussion of various "moving forward" proposals:
>
> - ORNL, Cisco, Mellanox discussing how to reduce cost of memory
>registration. Currently running some benchmarks to figure out where
>the bottlenecks are. Cheap registration would *help* (but not
>completely solve) overlap issues by reducing the number of sync
>points -- e.g., always just do one big RDMA operation (vs. the
>pipeline protocol).
>
> - Some discussion of a UD-based connect PC. Gleb mentions that what
>was proposed is effectively the same as the IBTA CM (i.e., ibcm).
>Jeff will go investigate.
>
> - Gleb also mentions that the connect PC needs to be based on the
>endpoint, not the entire module (for non-uniform hardware
>networks). Jeff took a to-do item to fix. Probably needs to be
>done for v1.3.
>
> - To UD or not to UD? Lots of discussion on this.
>
>- Some data has been presented by OSU showing that UD drops don't
> happen often. Their tests were run in a large non-blocking
> network. Some in the group feel that in a busy blocking network,
> UD drops will be [much] more common (there is at least some
> evidence that in a busy non-blocking network, drops *are* rare).
> This issue affects how we design the recovery of UD drops: if
> drops are rare, recovery can be arbitrarily expensive. If drops
> are not rare, recovery needs to be at least somewhat efficient.
> If drops are frequent, recovery needs to be cheap/fast/easy.
>
>- Mellanox is investigating why ibv_rc_pingpong gives better
> bandwidth than ibv_ud_pingpong (i.e., UD bandwidth is poor).
>
>- Discuss the possibility of doing connection caching: only allow so
> many RC connections at a time. Maintain an LRU of RC connections.
> When you need to close one, also recycle (or free) all of its
> posted buffers.
>
>- Discussion of MVAPICH technique for large UD messages: "[receiver]
> zero copy UD". Send a match header; receiver picks a UD QP from a
> ready pool and sends it back to the sender. Fragments from the
> user's buffer are posted to that QP on the receiver, so the sender
> sends straight into the receiver's target buffer. This scheme
> assumes no drops. For OMPI, this scheme also requires more
> complexity from our current multi-device striping method: we'd
> want to