Yo folks We discussed this a bit at last week’s developer telecon, and so I’m attempting to capture the options/plans as they were discussed so that others may chime in with suggestions.
Several new capabilities have been added to OMPI in recent months, all focused on exascale operations. Together, they combine to provide a reduced memory footprint and a potentially very fast launch time. They fall into two areas: 1. asynchronous addition of the ompi_proc_t structure to reduce memory footprint. We previously would allocate a struct for every process in the job at startup, even though most applications would only actually communicate to a small subset of their peers. This has been changed to allow allocation of the structs upon first message, thus meaning that you only use memory for those peers with which you actually communicate. Note that we do still create structs for *all* dynamically spawned processes (i.e., procs spawned via MPI_Comm_spawn) at time of launch. This option is controlled by the MCA param “mpi_add_procs_cutoff”. Jobs that have #procs < the cutoff will continue to create ompi_proc_t’s for every process during startup. If #procs > cutoff, then you’ll use the async addition method. 2. removal of barrier operations during MPI_Init. There currently are two barriers in MPI_Init - the first during the allgather collective which returns data posted by each process (the infamous “modex” operation), and the second right before completing MPI_Init that is used to ensure that all peers are ready to communicate. The modex operation is used to collect endpoint info from every other process in the job. As we have pointed out, most of that info is already known by the host RM, and we are working with the RM community to have them provide it - this would reduce the size of the modex message by roughly 90%. The remaining modex info provides endpoint info for each of the available transports. Many of our transports do not require this exchange as they operate on endpoints that can be computed based on knowledge of the other proc’s location and relative rank on that location (hostname and our node_rank). In those cases, we can just drop thru the modex. This is controlled by setting the MCA param “pmix_base_async_modex=1”. If this param is set, then any info that is not provided at startup, but is subsequently requested by a proc, will be retrieved via the “direct modex” operation - i.e., a request for the target proc’s data will be sent to the daemon hosting that proc, and the data retrieved and delivered to the requestor. The barrier at the end of MPI_Init is required to ensure that all procs are indeed ready to receive communications prior to allowing any proc to send a message. Some (if not most) of our transports don’t have an “ack” or connection mechanism to detect if the other side was able to receive a message. Thus, in the absence of a barrier, the possibility exists for a proc to send a message before the other side has fully prepared to recv it - thereby resulting in the undetected loss of the message. So we currently always execute this barrier. We also have a barrier during MPI_Finalize to ensure that all MPI messages have been handled prior to exiting. No discussion about potentially removing or making that one optional have been held as it fills a similar requirement as the barrier at the end of MPI_Init. There were two subjects of discussion: 1. the relationship, if any, between the two new MCA params. The async add_procs does not appear to have any major performance impact, though that remains to be fully proven. The async_modex is expected to help in all cases where the endpoints for all active transports are computed, regardless of connection topology. For other cases (i.e., non-computed endpoints), async modex will help for sparsely connected topologies, and hurt for densely connected topologies (e.g., those commonly found for OSHMEM apps). Conclusion: we’ll leave these as separate params for now as the linkage appears weak. 2. what to do about the other barriers. These are mostly driven by the characteristics of the specific active transports. Some can, or may someday be able to, support removal of one or both of the barriers. Accordingly, we decided to add flag(s) to the PML-BTL interface to indicate if barrier support is required prior to first message, and after last message. If any active BTL indicates it needs such support, then the corresponding barrier will be executed. Conclusion: Ralph will create two global variables to control barrier execution, and add the required “if-then” statements. Nathan will add the PML-BTL flags. I hope that accurately captured the intent of the participants. Please feel free to comment and/or correct Ralph