Re: [OMPI devel] RTE Issue IV: RTE/MPI relative modex responsibilities

2007-12-06 Thread Ralph H Castain



On 12/6/07 8:06 AM, "Shipman, Galen M."  wrote:

>>> 
>>> Do we really need a complete node map? A far as I can tell, it looks
>>> like the MPI layer only needs a list of local processes. So maybe it
>>> would be better to forget about the node ids at the mpi layer and just
>>> return the local procs.
>> 
>> I agree, though I don't think we want a parallel list of procs. We just need
>> to set the "local" flag in the existing ompi_proc_t structures.
>> 
> 
> Having a parallel list of procs makes perfect sense. That way ORTE can store
> ORTE information in the orte_proc_t and OMPI can store OMPI information in
> the ompi_proc_t. The ompi_proc_t could either "inherit" the orte_proc_t or
> have a pointer to it so that we have no duplication of data.
> 

Hmmmwell, I personally don't have an opinion either way regarding how
the info flows up to the MPI layer, so I'll leave that up to you MPI folks.
I can certainly create such a list if that's the way you want to go.
However, since I have no way of knowing that a job is an MPI-job or not, I
will point out that non-MPI jobs will see an increased memory footprint as a
result. Not sure we care a whole lot - depends upon what other uses people
are trying to make of ORTE (e.g., STCI).

I will point out that getting the complete node map to every process will
incur some penalty in terms of launch time, and still begs the question of
how it gets there. I can provide it when launching via the ORTE daemons, of
course (would have to send it via a daemon-to-local-proc RML message,
though, as it would be too large to put in enviro variables) - I assume
there must be a parallel mechanism for something like the Cray to provide
it?

I'm also not sure how something like SLURM or TM would provide this if/when
we tightly integrate (i.e., go through their daemons instead of ORTE's) -
could be a future issue.


> Having a global map makes sense, particularly for numerous communication
> scenarios, if I know all the processes are on the same node I may send a
> message to the lowest "vpid" on that node and he could then forward to
> everyone else.

Are you talking about RML communications? If so, we already have that in
place via the daemons.

Or are you talking about routing BTL/MPI messages?? I thought that was a
"no-no"?

> 
> 
>> One option is for the RTE to just pass in an enviro variable with a
>> comma-separated list of your local ranks, but that creates a problem down
>> the road when trying to integrate tighter with systems like SLURM where the
>> procs would get mass-launched (so the environment has to be the same for all
>> of them).
>> 
> Having a enviro variable with at comma-seperated list of local ranks doesn't
> seems like a bit of a hack to me.
> 
>>> 
>>> So my vote would be to leave the modex alone, but remove the node id,
>>> and add a function to get the list of local procs. It doesn't matter to
>>> me how the RTE implements that.
>> 
>> I think we would need to be careful here that we don't create a need for
>> more communication. We have two functions currently in the modex:
>> 
>> 1. how to exchange the info required to populate the ompi_proc_t structures;
>> and
>> 
>> 2. how to identify which of those procs are "local"
>> 
>> The problem with leaving the modex as it currently sits is that some
>> environments require a different mechanism for exchanging the ompi_proc_t
>> info. While most can use the RML, some can't. The same division of
>> capabilities applies to getting the "local" info, so it makes sense to me to
>> put the modex in a framework.
>> 
>> Otherwise, we wind up with a bunch of #if's in the code to support
>> environments like the Cray. I believe the mca system was put in place
>> precisely to avoid those kind of practices, so it makes sense to me to take
>> advantage of it.
>> 
>> 
>>> 
>>> Alternatively, if we did a process attribute system we could just use
>>> predefined attributes, and the runtime can get each process's node id
>>> however it wants.
>> 
>> Same problem as above, isn't it? Probably ignorance on my part, but it seems
>> to me that we simply exchange a modex framework for an attribute framework
>> (since each environment would have to get the attribute values in a
>> different manner) - don't we?
>> 
>> I have no problem with using attributes instead of the modex, but the issue
>> appears to be the same either way - you still need a framework to handle the
>> different methods.
>> 
>> 
>> Ralph
>> 
>>> 
>>> Tim
>>> 
>>> Ralph H Castain wrote:
 IV. RTE/MPI relative modex responsibilities
 The modex operation conducted during MPI_Init currently involves the
 exchange of two critical pieces of information:
 
 1. the location (i.e., node) of each process in my job so I can determine
 who shares a node with me. This is subsequently used by the shared memory
 subsystem for initialization and message routing; and
 
 2. BTL contact info for each process in my job.
 

Re: [OMPI devel] RTE Issue IV: RTE/MPI relative modex responsibilities

2007-12-06 Thread Ralph H Castain



On 12/6/07 8:09 AM, "Shipman, Galen M."  wrote:

> Sorry, to be clear that should have been:
> 
>> One option is for the RTE to just pass in an enviro variable with a
>> comma-separated list of your local ranks, but that creates a problem down
>> the road when trying to integrate tighter with systems like SLURM where the
>> procs would get mass-launched (so the environment has to be the same for all
>> of them).
>> 
> Having an enviro variable with a comma-seperated list of local ranks seems
> like a bit of a hack to me.

No argument - just trying to offer options for consideration. Not advocating
any of them yet. I'm still hoping for the "perfect solution" to show itself,
but I personally expect an acceptable compromise is the most likely
scenario.


> 
>>> 
>>> So my vote would be to leave the modex alone, but remove the node id,
>>> and add a function to get the list of local procs. It doesn't matter to
>>> me how the RTE implements that.
>> 
>> I think we would need to be careful here that we don't create a need for
>> more communication. We have two functions currently in the modex:
>> 
>> 1. how to exchange the info required to populate the ompi_proc_t structures;
>> and
>> 
>> 2. how to identify which of those procs are "local"
>> 
>> The problem with leaving the modex as it currently sits is that some
>> environments require a different mechanism for exchanging the ompi_proc_t
>> info. While most can use the RML, some can't. The same division of
>> capabilities applies to getting the "local" info, so it makes sense to me to
>> put the modex in a framework.
>> 
>> Otherwise, we wind up with a bunch of #if's in the code to support
>> environments like the Cray. I believe the mca system was put in place
>> precisely to avoid those kind of practices, so it makes sense to me to take
>> advantage of it.
>> 
>> 
>>> 
>>> Alternatively, if we did a process attribute system we could just use
>>> predefined attributes, and the runtime can get each process's node id
>>> however it wants.
>> 
>> Same problem as above, isn't it? Probably ignorance on my part, but it seems
>> to me that we simply exchange a modex framework for an attribute framework
>> (since each environment would have to get the attribute values in a
>> different manner) - don't we?
>> 
>> I have no problem with using attributes instead of the modex, but the issue
>> appears to be the same either way - you still need a framework to handle the
>> different methods.
>> 
>> 
>> Ralph
>> 
>>> 
>>> Tim
>>> 
>>> Ralph H Castain wrote:
 IV. RTE/MPI relative modex responsibilities
 The modex operation conducted during MPI_Init currently involves the
 exchange of two critical pieces of information:
 
 1. the location (i.e., node) of each process in my job so I can determine
 who shares a node with me. This is subsequently used by the shared memory
 subsystem for initialization and message routing; and
 
 2. BTL contact info for each process in my job.
 
 During our recent efforts to further abstract the RTE from the MPI layer,
 we
 pushed responsibility for both pieces of information into the MPI layer.
 This wasn't done capriciously - the modex has always included the exchange
 of both pieces of information, and we chose not to disturb that situation.
 
 However, the mixing of these two functional requirements does cause
 problems
 when dealing with an environment such as the Cray where BTL information is
 "exchanged" via an entirely different mechanism. In addition, it has been
 noted that the RTE (and not the MPI layer) actually "knows" the node
 location for each process.
 
 Hence, questions have been raised as to whether:
 
 (a) the modex should be built into a framework to allow multiple BTL
 exchange mechansims to be supported, or some alternative mechanism be used
 -
 one suggestion made was to implement an MPICH-like attribute exchange; and
 
 (b) the RTE should absorb responsibility for providing a "node map" of the
 processes in a job (note: the modex may -use- that info, but would no
 longer
 be required to exchange it). This has a number of implications that need to
 be carefully considered - e.g., the memory required to store the node map
 in
 every process is non-zero. On the other hand:
 
 (i) every proc already -does- store the node for every proc - it is simply
 stored in the ompi_proc_t structures as opposed to somewhere in the RTE. We
 would want to avoid duplicating that storage, but there would be no change
 in memory footprint if done carefully.
 
 (ii) every daemon already knows the node map for the job, so communicating
 that info to its local procs may not prove a major burden. However, the
 very
 environments where this subject may be an issue (e.g., the Cray) do not use
 our daemons, so some alternative mechanism 

Re: [OMPI devel] RTE Issue IV: RTE/MPI relative modex responsibilities

2007-12-06 Thread Shipman, Galen M.
Sorry, to be clear that should have been:

> One option is for the RTE to just pass in an enviro variable with a
> comma-separated list of your local ranks, but that creates a problem down
> the road when trying to integrate tighter with systems like SLURM where the
> procs would get mass-launched (so the environment has to be the same for all
> of them).
> 
Having an enviro variable with a comma-seperated list of local ranks seems
like a bit of a hack to me.

>> 
>> So my vote would be to leave the modex alone, but remove the node id,
>> and add a function to get the list of local procs. It doesn't matter to
>> me how the RTE implements that.
> 
> I think we would need to be careful here that we don't create a need for
> more communication. We have two functions currently in the modex:
> 
> 1. how to exchange the info required to populate the ompi_proc_t structures;
> and
> 
> 2. how to identify which of those procs are "local"
> 
> The problem with leaving the modex as it currently sits is that some
> environments require a different mechanism for exchanging the ompi_proc_t
> info. While most can use the RML, some can't. The same division of
> capabilities applies to getting the "local" info, so it makes sense to me to
> put the modex in a framework.
> 
> Otherwise, we wind up with a bunch of #if's in the code to support
> environments like the Cray. I believe the mca system was put in place
> precisely to avoid those kind of practices, so it makes sense to me to take
> advantage of it.
> 
> 
>> 
>> Alternatively, if we did a process attribute system we could just use
>> predefined attributes, and the runtime can get each process's node id
>> however it wants.
> 
> Same problem as above, isn't it? Probably ignorance on my part, but it seems
> to me that we simply exchange a modex framework for an attribute framework
> (since each environment would have to get the attribute values in a
> different manner) - don't we?
> 
> I have no problem with using attributes instead of the modex, but the issue
> appears to be the same either way - you still need a framework to handle the
> different methods.
> 
> 
> Ralph
> 
>> 
>> Tim
>> 
>> Ralph H Castain wrote:
>>> IV. RTE/MPI relative modex responsibilities
>>> The modex operation conducted during MPI_Init currently involves the
>>> exchange of two critical pieces of information:
>>> 
>>> 1. the location (i.e., node) of each process in my job so I can determine
>>> who shares a node with me. This is subsequently used by the shared memory
>>> subsystem for initialization and message routing; and
>>> 
>>> 2. BTL contact info for each process in my job.
>>> 
>>> During our recent efforts to further abstract the RTE from the MPI layer, we
>>> pushed responsibility for both pieces of information into the MPI layer.
>>> This wasn't done capriciously - the modex has always included the exchange
>>> of both pieces of information, and we chose not to disturb that situation.
>>> 
>>> However, the mixing of these two functional requirements does cause problems
>>> when dealing with an environment such as the Cray where BTL information is
>>> "exchanged" via an entirely different mechanism. In addition, it has been
>>> noted that the RTE (and not the MPI layer) actually "knows" the node
>>> location for each process.
>>> 
>>> Hence, questions have been raised as to whether:
>>> 
>>> (a) the modex should be built into a framework to allow multiple BTL
>>> exchange mechansims to be supported, or some alternative mechanism be used -
>>> one suggestion made was to implement an MPICH-like attribute exchange; and
>>> 
>>> (b) the RTE should absorb responsibility for providing a "node map" of the
>>> processes in a job (note: the modex may -use- that info, but would no longer
>>> be required to exchange it). This has a number of implications that need to
>>> be carefully considered - e.g., the memory required to store the node map in
>>> every process is non-zero. On the other hand:
>>> 
>>> (i) every proc already -does- store the node for every proc - it is simply
>>> stored in the ompi_proc_t structures as opposed to somewhere in the RTE. We
>>> would want to avoid duplicating that storage, but there would be no change
>>> in memory footprint if done carefully.
>>> 
>>> (ii) every daemon already knows the node map for the job, so communicating
>>> that info to its local procs may not prove a major burden. However, the very
>>> environments where this subject may be an issue (e.g., the Cray) do not use
>>> our daemons, so some alternative mechanism for obtaining the info would be
>>> required.
>>> 
>>> 
>>> So the questions to be considered here are:
>>> 
>>> (a) do we leave the current modex "as-is", to include exchange of the node
>>> map, perhaps including "#if" statements to support different exchange
>>> mechanisms?
>>> 
>>> (b) do we separate the two functions currently in the modex and push the
>>> requirement to obtain a node map into the RTE? If so, how do we want the MPI

Re: [OMPI devel] RTE Issue IV: RTE/MPI relative modex responsibilities

2007-12-06 Thread Shipman, Galen M.
>> 
>> Do we really need a complete node map? A far as I can tell, it looks
>> like the MPI layer only needs a list of local processes. So maybe it
>> would be better to forget about the node ids at the mpi layer and just
>> return the local procs.
> 
> I agree, though I don't think we want a parallel list of procs. We just need
> to set the "local" flag in the existing ompi_proc_t structures.
> 

Having a parallel list of procs makes perfect sense. That way ORTE can store
ORTE information in the orte_proc_t and OMPI can store OMPI information in
the ompi_proc_t. The ompi_proc_t could either "inherit" the orte_proc_t or
have a pointer to it so that we have no duplication of data.

Having a global map makes sense, particularly for numerous communication
scenarios, if I know all the processes are on the same node I may send a
message to the lowest "vpid" on that node and he could then forward to
everyone else.


> One option is for the RTE to just pass in an enviro variable with a
> comma-separated list of your local ranks, but that creates a problem down
> the road when trying to integrate tighter with systems like SLURM where the
> procs would get mass-launched (so the environment has to be the same for all
> of them).
> 
Having a enviro variable with at comma-seperated list of local ranks doesn't
seems like a bit of a hack to me.

>> 
>> So my vote would be to leave the modex alone, but remove the node id,
>> and add a function to get the list of local procs. It doesn't matter to
>> me how the RTE implements that.
> 
> I think we would need to be careful here that we don't create a need for
> more communication. We have two functions currently in the modex:
> 
> 1. how to exchange the info required to populate the ompi_proc_t structures;
> and
> 
> 2. how to identify which of those procs are "local"
> 
> The problem with leaving the modex as it currently sits is that some
> environments require a different mechanism for exchanging the ompi_proc_t
> info. While most can use the RML, some can't. The same division of
> capabilities applies to getting the "local" info, so it makes sense to me to
> put the modex in a framework.
> 
> Otherwise, we wind up with a bunch of #if's in the code to support
> environments like the Cray. I believe the mca system was put in place
> precisely to avoid those kind of practices, so it makes sense to me to take
> advantage of it.
> 
> 
>> 
>> Alternatively, if we did a process attribute system we could just use
>> predefined attributes, and the runtime can get each process's node id
>> however it wants.
> 
> Same problem as above, isn't it? Probably ignorance on my part, but it seems
> to me that we simply exchange a modex framework for an attribute framework
> (since each environment would have to get the attribute values in a
> different manner) - don't we?
> 
> I have no problem with using attributes instead of the modex, but the issue
> appears to be the same either way - you still need a framework to handle the
> different methods.
> 
> 
> Ralph
> 
>> 
>> Tim
>> 
>> Ralph H Castain wrote:
>>> IV. RTE/MPI relative modex responsibilities
>>> The modex operation conducted during MPI_Init currently involves the
>>> exchange of two critical pieces of information:
>>> 
>>> 1. the location (i.e., node) of each process in my job so I can determine
>>> who shares a node with me. This is subsequently used by the shared memory
>>> subsystem for initialization and message routing; and
>>> 
>>> 2. BTL contact info for each process in my job.
>>> 
>>> During our recent efforts to further abstract the RTE from the MPI layer, we
>>> pushed responsibility for both pieces of information into the MPI layer.
>>> This wasn't done capriciously - the modex has always included the exchange
>>> of both pieces of information, and we chose not to disturb that situation.
>>> 
>>> However, the mixing of these two functional requirements does cause problems
>>> when dealing with an environment such as the Cray where BTL information is
>>> "exchanged" via an entirely different mechanism. In addition, it has been
>>> noted that the RTE (and not the MPI layer) actually "knows" the node
>>> location for each process.
>>> 
>>> Hence, questions have been raised as to whether:
>>> 
>>> (a) the modex should be built into a framework to allow multiple BTL
>>> exchange mechansims to be supported, or some alternative mechanism be used -
>>> one suggestion made was to implement an MPICH-like attribute exchange; and
>>> 
>>> (b) the RTE should absorb responsibility for providing a "node map" of the
>>> processes in a job (note: the modex may -use- that info, but would no longer
>>> be required to exchange it). This has a number of implications that need to
>>> be carefully considered - e.g., the memory required to store the node map in
>>> every process is non-zero. On the other hand:
>>> 
>>> (i) every proc already -does- store the node for every proc - it is simply
>>> stored in the ompi_proc_t structures as 

Re: [OMPI devel] RTE Issue IV: RTE/MPI relative modex responsibilities

2007-12-05 Thread Ralph H Castain



On 12/5/07 8:48 AM, "Tim Prins"  wrote:

> Well, I think it is pretty obvious that I am a fan of a attribute system :)
> 
> For completeness, I will point out that we also exchange architecture
> and hostname info in the modex.

True - except we should note that hostname info is only exchanged if someone
specifically requests it.

> 
> Do we really need a complete node map? A far as I can tell, it looks
> like the MPI layer only needs a list of local processes. So maybe it
> would be better to forget about the node ids at the mpi layer and just
> return the local procs.

I agree, though I don't think we want a parallel list of procs. We just need
to set the "local" flag in the existing ompi_proc_t structures.

One option is for the RTE to just pass in an enviro variable with a
comma-separated list of your local ranks, but that creates a problem down
the road when trying to integrate tighter with systems like SLURM where the
procs would get mass-launched (so the environment has to be the same for all
of them).

> 
> So my vote would be to leave the modex alone, but remove the node id,
> and add a function to get the list of local procs. It doesn't matter to
> me how the RTE implements that.

I think we would need to be careful here that we don't create a need for
more communication. We have two functions currently in the modex:

1. how to exchange the info required to populate the ompi_proc_t structures;
and

2. how to identify which of those procs are "local"

The problem with leaving the modex as it currently sits is that some
environments require a different mechanism for exchanging the ompi_proc_t
info. While most can use the RML, some can't. The same division of
capabilities applies to getting the "local" info, so it makes sense to me to
put the modex in a framework.

Otherwise, we wind up with a bunch of #if's in the code to support
environments like the Cray. I believe the mca system was put in place
precisely to avoid those kind of practices, so it makes sense to me to take
advantage of it.


> 
> Alternatively, if we did a process attribute system we could just use
> predefined attributes, and the runtime can get each process's node id
> however it wants.

Same problem as above, isn't it? Probably ignorance on my part, but it seems
to me that we simply exchange a modex framework for an attribute framework
(since each environment would have to get the attribute values in a
different manner) - don't we?

I have no problem with using attributes instead of the modex, but the issue
appears to be the same either way - you still need a framework to handle the
different methods.


Ralph

> 
> Tim
> 
> Ralph H Castain wrote:
>> IV. RTE/MPI relative modex responsibilities
>> The modex operation conducted during MPI_Init currently involves the
>> exchange of two critical pieces of information:
>> 
>> 1. the location (i.e., node) of each process in my job so I can determine
>> who shares a node with me. This is subsequently used by the shared memory
>> subsystem for initialization and message routing; and
>> 
>> 2. BTL contact info for each process in my job.
>> 
>> During our recent efforts to further abstract the RTE from the MPI layer, we
>> pushed responsibility for both pieces of information into the MPI layer.
>> This wasn't done capriciously - the modex has always included the exchange
>> of both pieces of information, and we chose not to disturb that situation.
>> 
>> However, the mixing of these two functional requirements does cause problems
>> when dealing with an environment such as the Cray where BTL information is
>> "exchanged" via an entirely different mechanism. In addition, it has been
>> noted that the RTE (and not the MPI layer) actually "knows" the node
>> location for each process.
>> 
>> Hence, questions have been raised as to whether:
>> 
>> (a) the modex should be built into a framework to allow multiple BTL
>> exchange mechansims to be supported, or some alternative mechanism be used -
>> one suggestion made was to implement an MPICH-like attribute exchange; and
>> 
>> (b) the RTE should absorb responsibility for providing a "node map" of the
>> processes in a job (note: the modex may -use- that info, but would no longer
>> be required to exchange it). This has a number of implications that need to
>> be carefully considered - e.g., the memory required to store the node map in
>> every process is non-zero. On the other hand:
>> 
>> (i) every proc already -does- store the node for every proc - it is simply
>> stored in the ompi_proc_t structures as opposed to somewhere in the RTE. We
>> would want to avoid duplicating that storage, but there would be no change
>> in memory footprint if done carefully.
>> 
>> (ii) every daemon already knows the node map for the job, so communicating
>> that info to its local procs may not prove a major burden. However, the very
>> environments where this subject may be an issue (e.g., the Cray) do not use
>> our daemons, so 

Re: [OMPI devel] RTE Issue IV: RTE/MPI relative modex responsibilities

2007-12-05 Thread Tim Prins

Well, I think it is pretty obvious that I am a fan of a attribute system :)

For completeness, I will point out that we also exchange architecture 
and hostname info in the modex.


Do we really need a complete node map? A far as I can tell, it looks 
like the MPI layer only needs a list of local processes. So maybe it 
would be better to forget about the node ids at the mpi layer and just 
return the local procs.


So my vote would be to leave the modex alone, but remove the node id, 
and add a function to get the list of local procs. It doesn't matter to 
me how the RTE implements that.


Alternatively, if we did a process attribute system we could just use 
predefined attributes, and the runtime can get each process's node id 
however it wants.


Tim

Ralph H Castain wrote:

IV. RTE/MPI relative modex responsibilities
The modex operation conducted during MPI_Init currently involves the
exchange of two critical pieces of information:

1. the location (i.e., node) of each process in my job so I can determine
who shares a node with me. This is subsequently used by the shared memory
subsystem for initialization and message routing; and

2. BTL contact info for each process in my job.

During our recent efforts to further abstract the RTE from the MPI layer, we
pushed responsibility for both pieces of information into the MPI layer.
This wasn't done capriciously - the modex has always included the exchange
of both pieces of information, and we chose not to disturb that situation.

However, the mixing of these two functional requirements does cause problems
when dealing with an environment such as the Cray where BTL information is
"exchanged" via an entirely different mechanism. In addition, it has been
noted that the RTE (and not the MPI layer) actually "knows" the node
location for each process.

Hence, questions have been raised as to whether:

(a) the modex should be built into a framework to allow multiple BTL
exchange mechansims to be supported, or some alternative mechanism be used -
one suggestion made was to implement an MPICH-like attribute exchange; and

(b) the RTE should absorb responsibility for providing a "node map" of the
processes in a job (note: the modex may -use- that info, but would no longer
be required to exchange it). This has a number of implications that need to
be carefully considered - e.g., the memory required to store the node map in
every process is non-zero. On the other hand:

(i) every proc already -does- store the node for every proc - it is simply
stored in the ompi_proc_t structures as opposed to somewhere in the RTE. We
would want to avoid duplicating that storage, but there would be no change
in memory footprint if done carefully.

(ii) every daemon already knows the node map for the job, so communicating
that info to its local procs may not prove a major burden. However, the very
environments where this subject may be an issue (e.g., the Cray) do not use
our daemons, so some alternative mechanism for obtaining the info would be
required.


So the questions to be considered here are:

(a) do we leave the current modex "as-is", to include exchange of the node
map, perhaps including "#if" statements to support different exchange
mechanisms?

(b) do we separate the two functions currently in the modex and push the
requirement to obtain a node map into the RTE? If so, how do we want the MPI
layer to retrieve that info so we avoid increasing our memory footprint?

(c) do we create a separate modex framework for handling the different
exchange mechanisms for BTL info, do we incorporate it into an existing one
(if so, which one), the new publish-subscribe framework, implement an
alternative approach, or...?

(d) other suggestions?

Ralph


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] RTE Issue IV: RTE/MPI relative modex responsibilities

2007-12-04 Thread Ralph H Castain
IV. RTE/MPI relative modex responsibilities
The modex operation conducted during MPI_Init currently involves the
exchange of two critical pieces of information:

1. the location (i.e., node) of each process in my job so I can determine
who shares a node with me. This is subsequently used by the shared memory
subsystem for initialization and message routing; and

2. BTL contact info for each process in my job.

During our recent efforts to further abstract the RTE from the MPI layer, we
pushed responsibility for both pieces of information into the MPI layer.
This wasn't done capriciously - the modex has always included the exchange
of both pieces of information, and we chose not to disturb that situation.

However, the mixing of these two functional requirements does cause problems
when dealing with an environment such as the Cray where BTL information is
"exchanged" via an entirely different mechanism. In addition, it has been
noted that the RTE (and not the MPI layer) actually "knows" the node
location for each process.

Hence, questions have been raised as to whether:

(a) the modex should be built into a framework to allow multiple BTL
exchange mechansims to be supported, or some alternative mechanism be used -
one suggestion made was to implement an MPICH-like attribute exchange; and

(b) the RTE should absorb responsibility for providing a "node map" of the
processes in a job (note: the modex may -use- that info, but would no longer
be required to exchange it). This has a number of implications that need to
be carefully considered - e.g., the memory required to store the node map in
every process is non-zero. On the other hand:

(i) every proc already -does- store the node for every proc - it is simply
stored in the ompi_proc_t structures as opposed to somewhere in the RTE. We
would want to avoid duplicating that storage, but there would be no change
in memory footprint if done carefully.

(ii) every daemon already knows the node map for the job, so communicating
that info to its local procs may not prove a major burden. However, the very
environments where this subject may be an issue (e.g., the Cray) do not use
our daemons, so some alternative mechanism for obtaining the info would be
required.


So the questions to be considered here are:

(a) do we leave the current modex "as-is", to include exchange of the node
map, perhaps including "#if" statements to support different exchange
mechanisms?

(b) do we separate the two functions currently in the modex and push the
requirement to obtain a node map into the RTE? If so, how do we want the MPI
layer to retrieve that info so we avoid increasing our memory footprint?

(c) do we create a separate modex framework for handling the different
exchange mechanisms for BTL info, do we incorporate it into an existing one
(if so, which one), the new publish-subscribe framework, implement an
alternative approach, or...?

(d) other suggestions?

Ralph