Re: [OMPI devel] RFC: changes to modex

2008-04-15 Thread Tim Prins
Hate to bring this up again, but I was thinking that an easy way to 
reduce the size of the modex would be to reduce the length of the names 
describing each piece of data.


More concretely, for a simple run I get the following names, each of 
which are sent over the wire for every proc (note that this will change 
depending on the number of btls one has active):

ompi-proc-info
btl.openib.1.3
btl.tcp.1.3
pml.base.1.0
btl.udapl.1.3

So that's 89 bytes of naming overhead (size of strings + dss packing 
overhead) per process.


A couple of possible solutions to this:
1. Send a 32 bit string hashes instead of the strings. This would reduce 
the per process size from 89 to 20 bytes, but there is always a (slight) 
possibility of collisions.


2. Change the way the dss packs strings. Currently, it packs a 32 bit 
sting length, the string, and a null terminator. It may be good enough 
to just pack the string a the NULL terminator. This would reduce 
per-process size from 89 to 69 bytes.


3. Reduce the length of the names. 'ompi-proc-info' could become simply 
'pinf', and two of the separators could be removed in the other names 
(ex: 'btl.openib.1.3' -> 'btlopenib1.3'). This would change the per 
process size from 89 to 71 bytes.


4. Do #2 & #3. This would change the per process size from 89 to 51 bytes.

Anyways, just an idea for consideration.

Tim


WHAT: Changes to MPI layer modex API

WHY: To be mo' betta scalable

WHERE: ompi/mpi/runtime/ompi_module_exchange.* and everywhere that
calls ompi_modex_send() and/or ompi_modex_recv()

TIMEOUT: COB Fri 4 Apr 2008

DESCRIPTION:

Per some of the scalability discussions that have been occurring (some
on-list and some off-list), and per the e-mail I sent out last week
about ongoing work in the openib BTL, Ralph and I put together a loose
proposal this morning to make the modex more scalable. The timeout is
fairly short because Ralph wanted to start implementing in the near
future, and we didn't anticipate that this would be a contentious
proposal.

The theme is to break the modex into two different kinds of data:

- Modex data that is specific to a given proc
- Modex data that is applicable to all procs on a given node

For example, in the openib BTL, the majority of modex data is
applicable to all processes on the same node (GIDs and LIDs and
whatnot). It is much more efficient to send only one copy of such
node-specific data to each process (vs. sending ppn copies to each
process). The spreadsheet I included in last week's e-mail clearly
shows this.

1. Add new modex API functions. The exact function signatures are
TBD, but they will be generally of the form:

  * int ompi_modex_proc_send(...): send modex data that is specific to
this process. It is just about exactly the same as the current API
call (ompi_modex_send).

  * int ompi_modex_proc_recv(...): receive modex data from a specified
peer process (indexed on ompi_proc_t*). It is just about exactly the
same as the current API call (ompi_modex_recv).

  * int ompi_modex_node_send(...): send modex data that is relevant
for all processes in this job on this node. It is intended that only
one process in a job on a node will call this function. If more than
one process in a job on a node calls _node_send(), then only one will
"win" (meaning that the data sent by the others will be overwritten).

  * int ompi_modex_node_recv(...): receive modex data that is relevant
for a whole peer node; receive the ["winning"] blob sent by
_node_send() from the source node. We haven't yet decided what the
node index will be; it may be (ompi_proc_t*) (i.e., _node_recv() would
figure out what node the (ompi_proc_t*) resides on and then give you
the data).

2. Make the existing modex API calls (ompi_modex_send,
ompi_modex_recv) be wrappers around the new "proc" send/receive
calls. This will provide exactly the same functionality as the
current API (but be sub-optimal at scale). It will give BTL authors
(etc.) time to update to the new API, potentially taking advantage of
common data across multiple processes on the same node. We'll likely
put in some opal_output()'s in the wrappers to help identify code that
is still calling the old APIs.

3. Remove the old API calls (ompi_modex_send, ompi_modex_recv) before
v1.3 is released.






Re: [OMPI devel] RFC: changes to modex

2008-04-03 Thread Jeff Squyres

On Apr 3, 2008, at 11:16 AM, Jeff Squyres wrote:


The size of the openib modex is explained in btl_openib_component.c in
the branch.  It's a packed message now; we don't just blindly copy an
entire struct.  Here's the comment:

/* The message is packed into multiple parts:
 * 1. a uint8_t indicating the number of modules (ports) in the
message
 * 2. for each module:
 *a. the common module data
 *b. a uint8_t indicating how many CPCs follow
 *c. for each CPC:
 *   a. a uint8_t indicating the index of the CPC in the all[]
 *  array in btl_openib_connect_base.c
 *   b. a uint8_t indicating the priority of this CPC
 *   c. a uint8_t indicating the length of the blob to follow
 *   d. a blob that is only meaningful to that CPC
 */

The common module data is what I sent in the other message.



Gaa.. I forgot to finish explaining the spreadsheet before I sent  
this; sorry...


The 4 lines of oob/xoob/ibcm/rdmacm cpc sizes are how many bytes those  
cpc's contribute (on a per-port basis) to the modex.  "size 1" is what  
they currently contribute.  "size 2" is if Jon and I are able to shave  
off a few more bytes (not entirely sure that's possible yet).


The machine 1 and machine 2 are three configurations each of two  
sample machines.


The first block of numbers is how big the openib part of the modex is  
when only using the ibcm cpc, when only using the rdmacm cpc, and when  
using both the ibcm and rdmacm cpc's (i.e., both are sent in the  
modex; one will "win" and be used at run-time).  The overall number is  
a result of plugging in the numbers from the machine parameters  
(nodes, ppn, num ports) and the ibcm/rdmacm cpc sizes to the formula  
at the top of the spreadsheet.


The second block of numbers if modifying the formula at the top of the  
spreadsheet to calculate basically sending the per-port information  
only once (this modified formula did not include sending a per-port  
bitmap as came up later in the thread).  The green numbers in that  
block are the differences between these numbers and the first block.


The third block of numbers is the same as the second block, but using  
the "size 2" cpc sizes.  The green numbers are the differences between  
these numbers and the first block; the blue numbers are the  
differences between these numbers and the second block.


-

Note: based on what came up later in the thread (e.g., not taking into  
account carto and whatnot), the 2nd and 3rd blocks of numbers are not  
entirely accurate.  But they're likely still in the right ballpark.   
My point was that the size differences from the 1st block and the 2nd/ 
3rd blocks seemed to be significant enough to warrant moving ahead  
with a "reduce replication in the modex" scheme.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] RFC: changes to modex

2008-04-03 Thread Jeff Squyres

On Apr 3, 2008, at 8:52 AM, Gleb Natapov wrote:

It'll increase it compared to the optimization that we're about to
make.  But it will certainly be a large decrease compared to what
we're doing today


May be I don't understand something in what you propose then.  
Currently

when I run two procs on the same node and each proc uses different HCA
each one of them sends message that describes the HCA in use by the
proc. The message is of the form .
Each proc send one of those so there are two message total on the  
wire.

You propose that one of them should send description of both
available ports (that is one of them sends two messages of the form
above) and then each proc send additional message with the index of  
the

HCA that it is going to use. And this is more data on the wire after
proposed optimization than we have now.


I guess what I'm trying to address is optimizing the common case.   
What I perceive the common case to be is:


- high PPN values (4, 8, 16, ...)
- PPN is larger than the number of verbs-capable ports
- homogeneous openfabrics network

Yes, you will definitely find other cases.  But I'd guess that this  
is, by far, the most common case (especially at scale).  I don't want  
to penalize the common case for the sake of some one-off installations.


I'm basing this optimization on the assumption that PPN's will be  
larger than the number of available ports, so there will guarantee to  
be duplication in the modex message.  Removing that duplication is the  
main goal of this optimization.



 (see the spreadsheet that I sent last week).

I've looked at it but I could not decipher it :( I don't understand
where all these numbers a come from.


Why didn't you ask?  :-)

The size of the openib modex is explained in btl_openib_component.c in  
the branch.  It's a packed message now; we don't just blindly copy an  
entire struct.  Here's the comment:


/* The message is packed into multiple parts:
 * 1. a uint8_t indicating the number of modules (ports) in the  
message

 * 2. for each module:
 *a. the common module data
 *b. a uint8_t indicating how many CPCs follow
 *c. for each CPC:
 *   a. a uint8_t indicating the index of the CPC in the all[]
 *  array in btl_openib_connect_base.c
 *   b. a uint8_t indicating the priority of this CPC
 *   c. a uint8_t indicating the length of the blob to follow
 *   d. a blob that is only meaningful to that CPC
 */

The common module data is what I sent in the other message.


I guess I don't see the problem...?

I like things to be simple. KISS principle I guess.


I can see your point that this is getting fairly complicated.  :-\   
See below.



And I do care about
heterogeneous include/exclude too.


How much?  I still think we can support it just fine; I just want to  
make [what I perceive to be] the common case better.


I looked at what kind of data we send during openib modex and I  
created
file with 1 openib modex messages. mtu, subnet id and cpc list  
where

the same in each message but lid/apm_lid where different, this is
pretty close approximation of the data that is sent from HN to each
process. The uncompressed file size is 489K compressed file size is  
43K.

More then 10 times smaller.



Was this the full modex message, or just the openib part?

Those are promising sizes (43k), though; how long does it take to  
compress/uncompress this data in memory?  That also must be factored  
into the overall time.


How about a revised and combined proposal:

- openib: Use a simplified "send all ACTIVE ports" per-host message  
(i.e., before include/exclude and carto is applied)
- openib: Send a small bitmap for each proc indicating which ports  
each btl module will use
- modex: Compress the result (probably only if it's larger than some  
threshhold size?) when sending, decompress upon receive


This keeps it simple -- no special cases for heterogeneous include/ 
exclude, etc.  And if compression is cheap (can you do some  
experiments to find out?), perhaps we can link against libz (I see the  
libz in at least RHEL4 is BSD licensed, so there's no issue there) and  
de/compress in memory.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] RFC: changes to modex

2008-04-03 Thread Jeff Squyres

On Apr 3, 2008, at 9:18 AM, Gleb Natapov wrote:

I am talking about openib part of the modex. The "garbage" I am
referring to is this:


FWIW, on the openib-cpc2 branch, the base data that is sent in the  
modex is this:


uint64_t subnet_id;
/** LID of this port */
uint16_t lid;
/** APM LID for this port */
uint16_t apm_lid;
/** The MTU used by this port */
uint8_t mtu;

lid is used by both the xoob and ibcm cpc's.  We can skip packing the  
apm_lid if apm support is not used if you really want to.  The MTU has  
been changed to the 8 bit enum value.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] RFC: changes to modex

2008-04-03 Thread Ralph H Castain
H...since I have no control nor involvement in what gets sent, perhaps I
can be a disinterested third party. ;-)

Could you perhaps explain this comment:

> BTW I looked at how we do modex now on the trunk. For OOB case more
> than half the data we send for each proc is garbage.


What "garbage" are you referring to? I am working to remove the stuff
inserted by proc.c - mostly hostname, hopefully arch, etc. If you are
running a "debug" version, there will also be type descriptors for each
entry, but those are eliminated for optimized builds.

So are you referring to other things?

Thanks
Ralph


On 4/3/08 6:52 AM, "Gleb Natapov"  wrote:

> On Wed, Apr 02, 2008 at 08:41:14PM -0400, Jeff Squyres wrote:
 that it's the same for all procs on all hosts.  I guess there's a few
 cases:
 
 1. homogeneous include/exclude, no carto: send all in node info; no
 proc info
 2. homogeneous include/exclude, carto is used: send all ports in node
 info; send index in proc info for which node info port index it
 will use
>>> This may actually increase modex size. Think about two procs using two
>>> different hcas. We'll send all the data we send today + indexes.
>> 
>> It'll increase it compared to the optimization that we're about to
>> make.  But it will certainly be a large decrease compared to what
>> we're doing today
> 
> May be I don't understand something in what you propose then. Currently
> when I run two procs on the same node and each proc uses different HCA
> each one of them sends message that describes the HCA in use by the
> proc. The message is of the form .
> Each proc send one of those so there are two message total on the wire.
> You propose that one of them should send description of both
> available ports (that is one of them sends two messages of the form
> above) and then each proc send additional message with the index of the
> HCA that it is going to use. And this is more data on the wire after
> proposed optimization than we have now.
> 
> 
>>   (see the spreadsheet that I sent last week).
> I've looked at it but I could not decipher it :( I don't understand
> where all these numbers a come from.
> 
>> 
>> Indeed, we can even put in the optimization that if there's only one
>> process on a host, it can only publish the ports that it will use (and
>> therefore there's no need for the proc data).
> More special cases :(
> 
>> 
 3. heterogeneous include/exclude, no cart: need user to tell us that
 this situation exists (e.g., use another MCA param), but then is same
 as #2
 4. heterogeneous include/exclude, cart is used, same as #3
 
 Right?
 
>>> Looks like it. FWIW I don't like the idea to code all those special
>>> cases. The way it works now I can be pretty sure that any crazy setup
>>> I'll come up with will work.
>> 
>> And so it will with the new scheme.  The only place it won't work is
>> if the user specifies a heterogeneous include/exclude (i.e., we'll
>> require that the user tells us when they do that), which nobody does.
>> 
>> I guess I don't see the problem...?
> I like things to be simple. KISS principle I guess. And I do care about
> heterogeneous include/exclude too.
> 
> BTW I looked at how we do modex now on the trunk. For OOB case more
> than half the data we send for each proc is garbage.
> 
>> 
>>> By the way how much data are moved during modex stage? What if modex
>>> will use compression?
>> 
>> 
>> The spreadsheet I listed was just the openib part of the modex, and it
>> was fairly hefty.  I have no idea how well (or not) it would compress.
>> 
> I looked at what kind of data we send during openib modex and I created
> file with 1 openib modex messages. mtu, subnet id and cpc list where
> the same in each message but lid/apm_lid where different, this is
> pretty close approximation of the data that is sent from HN to each
> process. The uncompressed file size is 489K compressed file size is 43K.
> More then 10 times smaller.
> 
> --
> Gleb.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] RFC: changes to modex

2008-04-02 Thread Jeff Squyres

On Apr 2, 2008, at 1:58 PM, Gleb Natapov wrote:

No, I think it would be fine to only send the output after
btl_openib_if_in|exclude is applied.  Perhaps we need an MCA param to
say "always send everything" in the case that someone applies a non-
homogeneous if_in|exclude set of values...?

When is carto stuff applied?  Is that what you're really asking  
about?



There is no difference between carto and include/exclude.


You mean in terms of when they are applied?


I can specify
different openib_if_include values for different procs on the same  
host.



I know you *can*, but it is certainly uncommon.  The common case is  
that it's the same for all procs on all hosts.  I guess there's a few  
cases:


1. homogeneous include/exclude, no carto: send all in node info; no  
proc info
2. homogeneous include/exclude, carto is used: send all ports in node  
info; send index in proc info for which node info port index it will use
3. heterogeneous include/exclude, no cart: need user to tell us that  
this situation exists (e.g., use another MCA param), but then is same  
as #2

4. heterogeneous include/exclude, cart is used, same as #3

Right?

--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] RFC: changes to modex

2008-04-02 Thread Gleb Natapov
On Wed, Apr 02, 2008 at 12:08:47PM -0400, Jeff Squyres wrote:
> On Apr 2, 2008, at 11:13 AM, Gleb Natapov wrote:
> > On Wed, Apr 02, 2008 at 10:35:03AM -0400, Jeff Squyres wrote:
> >> If we use carto to limit hcas/ports are used on a given host on a  
> >> per-
> >> proc basis, then we can include some proc_send data to say "this proc
> >> only uses indexes X,Y,Z from the node data".  The indexes can be
> >> either uint8_ts, or maybe even a variable length bitmap.
> >>
> > So you propose that each proc will send info (using node_send())  
> > about every
> > hca/proc on a host even about those that are excluded from use by  
> > the proc
> > just in case? And then each proc will have to send additional info  
> > (using
> > proc_send() this time) to indicate what hcas/ports it is actually  
> > using?
> 
> 
> No, I think it would be fine to only send the output after  
> btl_openib_if_in|exclude is applied.  Perhaps we need an MCA param to  
> say "always send everything" in the case that someone applies a non- 
> homogeneous if_in|exclude set of values...?
> 
> When is carto stuff applied?  Is that what you're really asking about?
> 
There is no difference between carto and include/exclude. I can specify
different openib_if_include values for different procs on the same host.

--
Gleb.


Re: [OMPI devel] RFC: changes to modex

2008-04-02 Thread Jeff Squyres

On Apr 2, 2008, at 11:13 AM, Gleb Natapov wrote:

On Wed, Apr 02, 2008 at 10:35:03AM -0400, Jeff Squyres wrote:
If we use carto to limit hcas/ports are used on a given host on a  
per-

proc basis, then we can include some proc_send data to say "this proc
only uses indexes X,Y,Z from the node data".  The indexes can be
either uint8_ts, or maybe even a variable length bitmap.

So you propose that each proc will send info (using node_send())  
about every
hca/proc on a host even about those that are excluded from use by  
the proc
just in case? And then each proc will have to send additional info  
(using
proc_send() this time) to indicate what hcas/ports it is actually  
using?



No, I think it would be fine to only send the output after  
btl_openib_if_in|exclude is applied.  Perhaps we need an MCA param to  
say "always send everything" in the case that someone applies a non- 
homogeneous if_in|exclude set of values...?


When is carto stuff applied?  Is that what you're really asking about?

--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] RFC: changes to modex

2008-04-02 Thread Ralph H Castain



On 4/2/08 8:52 AM, "Terry Dontje"  wrote:

> Jeff Squyres wrote:
>> WHAT: Changes to MPI layer modex API
>> 
>> WHY: To be mo' betta scalable
>> 
>> WHERE: ompi/mpi/runtime/ompi_module_exchange.* and everywhere that
>> calls ompi_modex_send() and/or ompi_modex_recv()
>> 
>> TIMEOUT: COB Fri 4 Apr 2008
>> 
>> DESCRIPTION:
>> 
>>   
> [...snip...]
>>   * int ompi_modex_node_send(...): send modex data that is relevant
>> for all processes in this job on this node.  It is intended that only
>> one process in a job on a node will call this function.  If more than
>> one process in a job on a node calls _node_send(), then only one will
>> "win" (meaning that the data sent by the others will be overwritten).
>> 
>>   
>>   * int ompi_modex_node_recv(...): receive modex data that is relevant
>> for a whole peer node; receive the ["winning"] blob sent by
>> _node_send() from the source node.  We haven't yet decided what the
>> node index will be; it may be (ompi_proc_t*) (i.e., _node_recv() would
>> figure out what node the (ompi_proc_t*) resides on and then give you
>> the data).
>> 
>>   
> The above sounds like there could be race conditions if more than one
> process on a node is doing
> ompi_modex_node_send.  That is are you really going to be able to be
> assured when ompi_modex_node_recv
> is done that one of the processes is not in the middle of doing
> ompi_modex_node_send?  I assume
> there must be some sort of gate that allows you to make sure no one is
> in the middle of overwriting your data.

The nature of the modex actually precludes this. The modex is implemented as
a barrier, so the timing actually looks like this:

1. each proc registers its modex_node[proc]_send calls early in MPI_Init.
All this does is collect the data locally in a buffer

2. each proc hits the orte_grpcomm.modex call in MPI_Init. At this point,
the collected data is sent to the local daemon. The proc "barriers" at this
point and can go no further until the modex is completed.

3. when the daemon detects that all local procs have sent it a modex buffer,
it enters an "allgather" operation across all daemons. When that operation
completes, each daemon has a complete modex buffer spanning the job.

4. each daemon "drops" the collected buffer into each local proc

5. each proc, upon receiving the modex buffer, decodes it and sets up its
data structs to respond to future modex_recv calls. Once that is completed,
the proc returns from the orte_grpcomm.modex call and is released from the
"barrier".


So we resolve the race condition by including a "barrier" inside the modex.
This is the current behavior as well - so this represents no change, just a
different organization of the modex'd data.

> 
> --td
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] RFC: changes to modex

2008-04-02 Thread Gleb Natapov
On Wed, Apr 02, 2008 at 10:35:03AM -0400, Jeff Squyres wrote:
> If we use carto to limit hcas/ports are used on a given host on a per- 
> proc basis, then we can include some proc_send data to say "this proc  
> only uses indexes X,Y,Z from the node data".  The indexes can be  
> either uint8_ts, or maybe even a variable length bitmap.
> 
So you propose that each proc will send info (using node_send()) about every
hca/proc on a host even about those that are excluded from use by the proc
just in case? And then each proc will have to send additional info (using
proc_send() this time) to indicate what hcas/ports it is actually using?

--
Gleb.


Re: [OMPI devel] RFC: changes to modex

2008-04-02 Thread Tim Prins
Is there a reason to rename ompi_modex_{send,recv} to 
ompi_modex_proc_{send,recv}? It seems simpler (and no more confusing and 
less work) to leave the names alone and add ompi_modex_node_{send,recv}.


Another question: Does the receiving process care that the information 
received applies to a whole node? I ask because maybe we could get the 
same effect by simply adding a parameter to ompi_modex_send which 
specifies if the data applies to just the proc or a whole node.


So, if we have ranks 1 & 2 on n1, and rank 3 on n2, then rank 1 would do:
ompi_modex_send("arch", arch, );
then rank 3 would do:
ompi_modex_recv(rank 1, "arch");
ompi_modex_recv(rank 2, "arch");

I don't really care either way, just wanted to throw out the idea.

Tim

Jeff Squyres wrote:

WHAT: Changes to MPI layer modex API

WHY: To be mo' betta scalable

WHERE: ompi/mpi/runtime/ompi_module_exchange.* and everywhere that  
calls ompi_modex_send() and/or ompi_modex_recv()


TIMEOUT: COB Fri 4 Apr 2008

DESCRIPTION:

Per some of the scalability discussions that have been occurring (some  
on-list and some off-list), and per the e-mail I sent out last week  
about ongoing work in the openib BTL, Ralph and I put together a loose  
proposal this morning to make the modex more scalable.  The timeout is  
fairly short because Ralph wanted to start implementing in the near  
future, and we didn't anticipate that this would be a contentious  
proposal.


The theme is to break the modex into two different kinds of data:

- Modex data that is specific to a given proc
- Modex data that is applicable to all procs on a given node

For example, in the openib BTL, the majority of modex data is  
applicable to all processes on the same node (GIDs and LIDs and  
whatnot).  It is much more efficient to send only one copy of such  
node-specific data to each process (vs. sending ppn copies to each  
process).  The spreadsheet I included in last week's e-mail clearly  
shows this.


1. Add new modex API functions.  The exact function signatures are  
TBD, but they will be generally of the form:


  * int ompi_modex_proc_send(...): send modex data that is specific to  
this process.  It is just about exactly the same as the current API  
call (ompi_modex_send).


  * int ompi_modex_proc_recv(...): receive modex data from a specified  
peer process (indexed on ompi_proc_t*).  It is just about exactly the  
same as the current API call (ompi_modex_recv).


  * int ompi_modex_node_send(...): send modex data that is relevant  
for all processes in this job on this node.  It is intended that only  
one process in a job on a node will call this function.  If more than  
one process in a job on a node calls _node_send(), then only one will  
"win" (meaning that the data sent by the others will be overwritten).


  * int ompi_modex_node_recv(...): receive modex data that is relevant  
for a whole peer node; receive the ["winning"] blob sent by  
_node_send() from the source node.  We haven't yet decided what the  
node index will be; it may be (ompi_proc_t*) (i.e., _node_recv() would  
figure out what node the (ompi_proc_t*) resides on and then give you  
the data).


2. Make the existing modex API calls (ompi_modex_send,  
ompi_modex_recv) be wrappers around the new "proc" send/receive  
calls.  This will provide exactly the same functionality as the  
current API (but be sub-optimal at scale).  It will give BTL authors  
(etc.) time to update to the new API, potentially taking advantage of  
common data across multiple processes on the same node.  We'll likely  
put in some opal_output()'s in the wrappers to help identify code that  
is still calling the old APIs.


3. Remove the old API calls (ompi_modex_send, ompi_modex_recv) before  
v1.3 is released.






Re: [OMPI devel] RFC: changes to modex

2008-04-02 Thread Jeff Squyres

On Apr 2, 2008, at 10:27 AM, Gleb Natapov wrote:
In the case of openib BTL what part of modex are you going to send  
using

proc_send() and what part using node_send()?



In the /tmp-public/openib-cpc2 branch, almost all of it will go to the  
node_send().  The CPC's will likely now get 2 buffers: one for  
node_send, and one for proc_send.


The ibcm CPC, for example, can do everything in node_send (the  
service_id that I use in the ibcm calls is the proc's PID; ORTE may  
supply peer PIDs directly -- haven't decided if that's a good idea yet  
or not -- if it doesn't, the PID can be sent in the proc_send data).   
The rdmacm CPC may need a proc_send for the listening TCP port number;  
still need to figure that one out.


If we use carto to limit hcas/ports are used on a given host on a per- 
proc basis, then we can include some proc_send data to say "this proc  
only uses indexes X,Y,Z from the node data".  The indexes can be  
either uint8_ts, or maybe even a variable length bitmap.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] RFC: changes to modex

2008-04-02 Thread Gleb Natapov
On Wed, Apr 02, 2008 at 10:21:12AM -0400, Jeff Squyres wrote:
>   * int ompi_modex_proc_send(...): send modex data that is specific to  
> this process.  It is just about exactly the same as the current API  
> call (ompi_modex_send).
> 
[skip]
> 
>   * int ompi_modex_node_send(...): send modex data that is relevant  
> for all processes in this job on this node.  It is intended that only  
> one process in a job on a node will call this function.  If more than  
> one process in a job on a node calls _node_send(), then only one will  
> "win" (meaning that the data sent by the others will be overwritten).
> 
In the case of openib BTL what part of modex are you going to send using
proc_send() and what part using node_send()?

--
Gleb.