Re: [OMPI devel] [ofa-general] uDAPL EVD queue length issue
Jon Mason wrote: While working on OMPI udapl btl, I have noticed some "interesting" behavior. OFA udapl wants the evd queues to be a power of 2 and then will subtract 1 for book keeping (ie, so that internal head and tail pointers never touch except when the ring is empty). OFA udapl will report the queue length as this number (and not the original size requested) when queried. This becomes interesting when a power of 2 is passed in and then queried. For example, a requested queue of length 256 will report a length of 255 when queried. Something is not right. You should ALWAYS get at least what you request. On my system with an mthca, a request of 256 gets you 511. It is the verbs provider that is rounding up, not uDAPL. Here is my uDAPL debug output (DAPL_DBG_TYPE=0x) using dtest: cq_object_create: (0x519bb0,0x519d00) dapls_ib_cq_alloc: evd 0x519bb0 cqlen=256 dapls_ib_cq_alloc: new_cq 0x519d60 cqlen=511 This is before and after the ibv_create_cq call. uDAPL builds it's EVD resources based on what is returned from this call. I modified dtest to double check the dat_evd_query and I get the same: 8962 dto_rcv_evd created 0x519e80 8962 dto_req_evd QLEN - requested 256 and actual 511 What OFED release and device are you using? -arlin
Re: [OMPI devel] MPI_GROUP_EMPTY and MPI_Group_free()
On Dec 4, 2007, at 10:43 AM, Lisandro Dalcin wrote: * MPI_GROUP_EMPTY cannot be freed, as it is a predefined handle. Users have to always check if the result of a group operation is MPI_GROUP_EMPTY to know if they can or cannot free them. This way is similar to the current management of predefined datatypes. I'd be in favor of this, since it's consistent with the rest of the spec w.r.t. predefined handles. -- Jeff Squyres Cisco Systems
[OMPI devel] RTE Issue IV: RTE/MPI relative modex responsibilities
IV. RTE/MPI relative modex responsibilities The modex operation conducted during MPI_Init currently involves the exchange of two critical pieces of information: 1. the location (i.e., node) of each process in my job so I can determine who shares a node with me. This is subsequently used by the shared memory subsystem for initialization and message routing; and 2. BTL contact info for each process in my job. During our recent efforts to further abstract the RTE from the MPI layer, we pushed responsibility for both pieces of information into the MPI layer. This wasn't done capriciously - the modex has always included the exchange of both pieces of information, and we chose not to disturb that situation. However, the mixing of these two functional requirements does cause problems when dealing with an environment such as the Cray where BTL information is "exchanged" via an entirely different mechanism. In addition, it has been noted that the RTE (and not the MPI layer) actually "knows" the node location for each process. Hence, questions have been raised as to whether: (a) the modex should be built into a framework to allow multiple BTL exchange mechansims to be supported, or some alternative mechanism be used - one suggestion made was to implement an MPICH-like attribute exchange; and (b) the RTE should absorb responsibility for providing a "node map" of the processes in a job (note: the modex may -use- that info, but would no longer be required to exchange it). This has a number of implications that need to be carefully considered - e.g., the memory required to store the node map in every process is non-zero. On the other hand: (i) every proc already -does- store the node for every proc - it is simply stored in the ompi_proc_t structures as opposed to somewhere in the RTE. We would want to avoid duplicating that storage, but there would be no change in memory footprint if done carefully. (ii) every daemon already knows the node map for the job, so communicating that info to its local procs may not prove a major burden. However, the very environments where this subject may be an issue (e.g., the Cray) do not use our daemons, so some alternative mechanism for obtaining the info would be required. So the questions to be considered here are: (a) do we leave the current modex "as-is", to include exchange of the node map, perhaps including "#if" statements to support different exchange mechanisms? (b) do we separate the two functions currently in the modex and push the requirement to obtain a node map into the RTE? If so, how do we want the MPI layer to retrieve that info so we avoid increasing our memory footprint? (c) do we create a separate modex framework for handling the different exchange mechanisms for BTL info, do we incorporate it into an existing one (if so, which one), the new publish-subscribe framework, implement an alternative approach, or...? (d) other suggestions? Ralph
[OMPI devel] MPI_GROUP_EMPTY and MPI_Group_free()
Dear all, As I see some activity on a related ticked, below some comments I sended to Bill Gropp some days ago about this subject. Bill did not write me back, I know he is really busy. Group operations are supposed to return new groups, so the used has to free the result. Additionally, the standard say that those operations may return the empty group. Then the issue: if the empty group is returned, the user should or should not call MPI_Group_free()??. I could not find any part of the standard about freeing MPI_GROUP_EMPTY. This issue is very similar to the one in MPI-1 related to error handlers. I believe the standard should be a bit stricter here, but many possibilities are: * MPI_GROUP_EMPTY must be freed it it is the result of a group operation. This way similar to the management of predefined error handlers. * MPI_GROUP_EMPTY cannot be freed, as it is a predefined handle. Users have to always check if the result of a group operation is MPI_GROUP_EMPTY to know if they can or cannot free them. This way is similar to the current management of predefined datatypes. -- Lisandro Dalcín --- Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC) Instituto de Desarrollo Tecnológico para la Industria Química (INTEC) Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) PTLC - Güemes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594 -- Lisandro Dalcín --- Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC) Instituto de Desarrollo Tecnológico para la Industria Química (INTEC) Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) PTLC - Güemes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594
[OMPI devel] RTE Issue II: Interaction between the ROUTED and GRPCOMM frameworks
II. Interaction between the ROUTED and GRPCOMM frameworks When we initially developed these two frameworks within the RTE, we envisioned them to operate totally independently of each other. Thus, the grpcomm collectives provide algorithms such as a binomial "xcast" that uses the daemons to scalably send messages across the system. However, we recently realized that the efficacy of the current grpcomm algorithms directly hinge on the daemons being fully connected - which we were recently told may not be the case as other people introduce different ROUTED components. For example, using the binomial algorithm in grpcomm's xcast while having a ring topology selected in ROUTED would likely result in terrible performance. This raises the following questions: (a) should the GRPCOMM and ROUTED frameworks be consolidated to ensure that the group collectives algorithms properly "match" the communication topology? (b) should we automatically select the grpcomm/routed pairings based on some internal logic? (c) should we leave this "as-is" and the user is responsible for making intelligent choices (and for detecting when the performance is bad due to this mismatch)? (d) other suggestions? Ralph
[OMPI devel] RTE issues: scalability & complexity
Yo all As (I hope) many of you know, we are in a final phase of revamping ORTE to simplify the code, enhance scalability, and improve reliability. In working on this effort, we recently uncovered four issues that merit broader discussion (apologies in advance for verbosity). Although these somewhat relate, I realize that people may care about some and not others. Hence, to provide the chance to only comment on those you -do- care about, and to at least somewhat constrain the length of the emails, I will be sending out a series of four emails in this area. The issues will include: I. Support for non-MPI jobs II. Interaction between the ROUTED and GRPCOMM frameworks III. Collective communications across daemons IV. RTE/MPI relative modex responsibilities Please feel free to contact me and/or comment on any of these issues. As a reminder, if you do comment back to the Devel mailing list, please use "reply all" so I will also receive a copy of the message. Thanks Ralph