Re: [OMPI devel] [PATCH] openib btl: extensable cpc selection enablement

2008-01-14 Thread Pavel Shamis (Pasha)

Jon Mason wrote:
  

I have few machines with connectX and i will try to run MTT on Sunday.



Awesome!  I appreciate it.

  
After fixing the compilation problem in XRC part of code I was able to 
run mtt. Most of the test pass and one test
failed: mpi2c++_dynamics_test. The test pass without XRC. But I also see 
the test failed in trunk. Last time that is working is 1.3a1r17085

strange
Pasha.


Re: [OMPI devel] [PATCH] openib btl: extensable cpc selection enablement

2008-01-11 Thread Jeff Squyres

On Jan 10, 2008, at 4:17 AM, Pavel Shamis (Pasha) wrote:

This patch has been tested with IB and iWARP adapters on a 2 node  
system

(with it correctly choosing to use oob and happily ignoring iWARP
adapters).  It needs XRC testing and testing of larger node systems.


Did you run MTT over all thess changes ?
I have few machines with connectX and i will try to run MTT on Sunday.



I just completed a rather large MTT run over all the same tests and  
variants that I run MTT on nightly tarballs every night.  All looked  
good.


Be sure to test after r17112; there was a double free() fix at r17112  
that is necessary.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [PATCH] openib btl: extensable cpc selection enablement

2008-01-10 Thread Jeff Squyres

On Jan 10, 2008, at 11:55 AM, Jon Mason wrote:


BTW, I should point out that the modex CPC string list stuff is
currently somewhat wasteful in the presence of multiple ports on a
host.  This will definitely be bad at scale.  Specifically, we'll  
send

around a CPC string in the openib modex for *each* port.  This may be
repetitive (and wasteful at scale), especially if you have more than
one port/NIC of the same type in each host.  This can cause the modex
size to increase quite a bit.


While the message sent via modex is now longer, the number of messages
sent is the same.  So I would argue that this is only slight less
optimal than the current implementation.


Not at scale.

Consider if someone has 2,000 8-core servers, each with a 2-port HCA.   
Let's assume a full-machine run of 16,000 MPI processes, each who can  
use 2 ports. Let's assume non-ConnectX HCAs to be conservative, so  
they'll all be able to use the oob CPC (someday soon, RDMA CM and IBMC  
will also be available, but let's start small).


Each of the 16k MPI procs will have "oob"+sizeof(uint32_t) twice in  
their modex for a grand total of 14 extra bytes.  No big deal on an  
individual message, but consider that that's 16,000 * 14 = 224,000  
extra bytes being gathered to mpirun.


Then consider that the whole pile of modex data is glommed together  
and broadcast to each MPI process.  Hence, we're now sending an extra  
16,000 * 14 * 16,000 = 3,584,000,000 bytes sent across the network  
during MPI_INIT (in addition to whatever is already being sent in the  
modex).


Ralph's work on the new ORTE branch will help this quite a bit (with  
the routed oob stuff -- sending modex messages only once to each node,  
vs. once to each process), but still, the numbers are large:


- gather phase: 16,000 * 14 = 224,000 extra bytes
- scatter phase: 16,000 * 14 * 2,000 = 448,000 extra bytes

This is much more manageable, but still -- we should be careful when  
we can.


Switching to hashed names and index lists will save quite a bit.  For  
example, if we do a dumb hash of the cpc name down to 1 byte and send  
index lists of which ports use each cpc (each index can be 1 byte --  
leading to a max of 256 ports in each host, which is probably  
sufficient for the forseeable future!), we're down to 3 extra bytes in  
the modex which is much more manageable:


in today's non-routed OOB:
- gather phase: 16,000 * 3 = 48,000 extra bytes
- scatter phase: 16,000 * 3 * 16,000 = 768,000,000 extra bytes

in the soon-to-be per-host modex distribution:
- gather phase: 16,000 * 3 = 48,000 extra bytes
- scatter phase: 16,000 * 3 * 2,000 = 96,000,000 extra bytes

Additionally, the routed oob makes the reality even better than that,  
because it uses a tree distribution for the modex.  So although the  
raw number of bytes is the same as a per-host-but-not-routed modex  
distribution, the distribution is quite wide, potentially avoiding  
network congestion (because different ports/links/servers are  
involved, all in parallel).


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [PATCH] openib btl: extensable cpc selection enablement

2008-01-10 Thread Jon Mason
On Thu, Jan 10, 2008 at 09:58:35AM -0500, Jeff Squyres wrote:
> BTW, I should point out that the modex CPC string list stuff is  
> currently somewhat wasteful in the presence of multiple ports on a  
> host.  This will definitely be bad at scale.  Specifically, we'll send  
> around a CPC string in the openib modex for *each* port.  This may be  
> repetitive (and wasteful at scale), especially if you have more than  
> one port/NIC of the same type in each host.  This can cause the modex  
> size to increase quite a bit.
> 

While the message sent via modex is now longer, the number of messages
sent is the same.  So I would argue that this is only slight less
optimal than the current implementation.  

> We wanted to get this new scheme *working* and then optimize the modex  
> message string usage a bit.  Options for optimization include:
>
> 1. list the cpc names only once in the modex message, each followed by  
> an indexed list of which ports use them
> 
> 2. use a simple hashing function to eliminate the string names in the  
> modex altogether -- and for extra bonus points, combine with #1 to  
> only list the cpc's once
> 
> 3. ...?

4. Profit!

> 
> I'd say that this optimization is pretty important for v1.3 (but it  
> shouldn't be hard to do).
> 
> 
> On Jan 9, 2008, at 6:37 PM, Jon Mason wrote:
> 
> > The new cpc selection framework is now in place.  The patch below  
> > allows
> > for dynamic selection of cpc methods based on what is available.  It
> > also allows for inclusion/exclusions of methods.  It even futher  
> > allows
> > for modifying the priorities of certain cpc methods to better  
> > determine
> > the optimal cpc method.
> >
> > This patch also contains XRC compile time disablement (per Jeff's
> > patch).
> >
> > At a high level, the cpc selections works by walking through each cpc
> > and allowing it to test to see if it is permissable to run on this
> > mpirun.  It returns a priority if it is permissable or a -1 if not.   
> > All
> > of the cpc names and priorities are rolled into a string.  This string
> > is then encapsulated in a message and passed around all the ompi
> > processes.  Once received and unpacked, the list received is compared
> > to a local copy of the list.  The connection method is chosen by
> > comparing the lists passed around to all nodes via modex with the list
> > generated locally.  Any non-negative number is a potentially valid
> > connection method.  The method below of determining the optimal
> > connection method is to take the cross-section of the two lists.  The
> > highest single value (and the other side being non-negative) is  
> > selected
> > as the cpc method.
> >
> > Please test it out.  The tree can be found at
> > https://svn.open-mpi.org/svn/ompi/tmp-public/openib-cpc/
> >
> > This patch has been tested with IB and iWARP adapters on a 2 node  
> > system
> > (with it correctly choosing to use oob and happily ignoring iWARP
> > adapters).  It needs XRC testing and testing of larger node systems.
> >
> > Many thanks to Jeff for all of his help.
> >
> > Thanks,
> > Jon
> >
> > Index: ompi/mca/btl/openib/btl_openib_component.c
> > ===
> > --- ompi/mca/btl/openib/btl_openib_component.c  (revision 17101)
> > +++ ompi/mca/btl/openib/btl_openib_component.c  (working copy)
> > @@ -155,30 +155,70 @@
> >  */
> > static int btl_openib_modex_send(void)
> > {
> > -int rc, i;
> > -size_t  size;
> > -mca_btl_openib_port_info_t *ports = NULL;
> > +intrc, i;
> > +char *message, *offset;
> > +uint32_t size, size_save;
> > +size_t msg_size;
> >
> > -size = mca_btl_openib_component.ib_num_btls * sizeof  
> > (mca_btl_openib_port_info_t);
> > -if (size != 0) {
> > -ports = (mca_btl_openib_port_info_t *)malloc (size);
> > -if (NULL == ports) {
> > -BTL_ERROR(("Failed malloc: %s:%d\n", __FILE__,  
> > __LINE__));
> > -return OMPI_ERR_OUT_OF_RESOURCE;
> > -}
> > +/* The message is packed into 2 parts:
> > + * 1. a uint32_t indicating the number of ports in the message
> > + * 2. for each port:
> > + *a. the port data
> > + *b. a uint32_t indicating a string length
> > + *c. the string cpc list for that port, length specified by  
> > 2b.
> > + */
> > +msg_size = sizeof(uint32_t) +  
> > mca_btl_openib_component.ib_num_btls * (sizeof(uint32_t) +  
> > sizeof(mca_btl_openib_port_info_t));
> > +for (i = 0; i < mca_btl_openib_component.ib_num_btls; i++) {
> > +msg_size += strlen(mca_btl_openib_component.openib_btls[i]- 
> > >port_info.cpclist);
> > +}
> >
> > -for (i = 0; i < mca_btl_openib_component.ib_num_btls; i++) {
> > -mca_btl_openib_module_t *btl =  
> > mca_btl_openib_component.openib_btls[i];
> > -ports[i] = btl->port_info;
> > +if (0 == msg_size) {
> > +return 0;
> > +}
> > 

Re: [OMPI devel] [PATCH] openib btl: extensable cpc selection enablement

2008-01-10 Thread Jon Mason
On Thu, Jan 10, 2008 at 11:17:48AM +0200, Pavel Shamis (Pasha) wrote:
> Jon Mason wrote:
> > The new cpc selection framework is now in place.  The patch below allows
> > for dynamic selection of cpc methods based on what is available.  It
> > also allows for inclusion/exclusions of methods.  It even futher allows
> > for modifying the priorities of certain cpc methods to better determine
> > the optimal cpc method.
> >
> > This patch also contains XRC compile time disablement (per Jeff's
> > patch).
> >   
> Need make sure that CM stuff will be disabled during compile tine if it 
> is not installed on machine.

Oh, xoob doesn't show up at all for me.  So I'd say it is working. :-)

> > At a high level, the cpc selections works by walking through each cpc
> > and allowing it to test to see if it is permissable to run on this
> > mpirun.  It returns a priority if it is permissable or a -1 if not.  All
> > of the cpc names and priorities are rolled into a string.  This string
> > is then encapsulated in a message and passed around all the ompi
> > processes.  Once received and unpacked, the list received is compared
> > to a local copy of the list.  The connection method is chosen by
> > comparing the lists passed around to all nodes via modex with the list
> > generated locally.  Any non-negative number is a potentially valid
> > connection method.  The method below of determining the optimal
> > connection method is to take the cross-section of the two lists.  The
> > highest single value (and the other side being non-negative) is selected
> > as the cpc method.
> >
> > Please test it out.  The tree can be found at
> > https://svn.open-mpi.org/svn/ompi/tmp-public/openib-cpc/
> >
> > This patch has been tested with IB and iWARP adapters on a 2 node system
> > (with it correctly choosing to use oob and happily ignoring iWARP
> > adapters).  It needs XRC testing and testing of larger node systems.
> >   
> Did you run MTT over all thess changes ?

No I did not.  My thinking was that since this is just setting up the
connections, it'll either work or not.  Since MTT is more functional
testing, it wouldn't really stress it.  But it could be false reasoning
on my part.  My main test was running the full IMB suite on it.

Of course, as soon as I get rdma cm in a working state, I'll be running
MTT to stress that.

> I have few machines with connectX and i will try to run MTT on Sunday.

Awesome!  I appreciate it.

Thanks,
Jon

> -- 
> Pavel Shamis (Pasha)
> Mellanox Technologies
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] [PATCH] openib btl: extensable cpc selection enablement

2008-01-10 Thread Jeff Squyres
BTW, I should point out that the modex CPC string list stuff is  
currently somewhat wasteful in the presence of multiple ports on a  
host.  This will definitely be bad at scale.  Specifically, we'll send  
around a CPC string in the openib modex for *each* port.  This may be  
repetitive (and wasteful at scale), especially if you have more than  
one port/NIC of the same type in each host.  This can cause the modex  
size to increase quite a bit.


We wanted to get this new scheme *working* and then optimize the modex  
message string usage a bit.  Options for optimization include:


1. list the cpc names only once in the modex message, each followed by  
an indexed list of which ports use them


2. use a simple hashing function to eliminate the string names in the  
modex altogether -- and for extra bonus points, combine with #1 to  
only list the cpc's once


3. ...?

I'd say that this optimization is pretty important for v1.3 (but it  
shouldn't be hard to do).



On Jan 9, 2008, at 6:37 PM, Jon Mason wrote:

The new cpc selection framework is now in place.  The patch below  
allows

for dynamic selection of cpc methods based on what is available.  It
also allows for inclusion/exclusions of methods.  It even futher  
allows
for modifying the priorities of certain cpc methods to better  
determine

the optimal cpc method.

This patch also contains XRC compile time disablement (per Jeff's
patch).

At a high level, the cpc selections works by walking through each cpc
and allowing it to test to see if it is permissable to run on this
mpirun.  It returns a priority if it is permissable or a -1 if not.   
All

of the cpc names and priorities are rolled into a string.  This string
is then encapsulated in a message and passed around all the ompi
processes.  Once received and unpacked, the list received is compared
to a local copy of the list.  The connection method is chosen by
comparing the lists passed around to all nodes via modex with the list
generated locally.  Any non-negative number is a potentially valid
connection method.  The method below of determining the optimal
connection method is to take the cross-section of the two lists.  The
highest single value (and the other side being non-negative) is  
selected

as the cpc method.

Please test it out.  The tree can be found at
https://svn.open-mpi.org/svn/ompi/tmp-public/openib-cpc/

This patch has been tested with IB and iWARP adapters on a 2 node  
system

(with it correctly choosing to use oob and happily ignoring iWARP
adapters).  It needs XRC testing and testing of larger node systems.

Many thanks to Jeff for all of his help.

Thanks,
Jon

Index: ompi/mca/btl/openib/btl_openib_component.c
===
--- ompi/mca/btl/openib/btl_openib_component.c  (revision 17101)
+++ ompi/mca/btl/openib/btl_openib_component.c  (working copy)
@@ -155,30 +155,70 @@
 */
static int btl_openib_modex_send(void)
{
-int rc, i;
-size_t  size;
-mca_btl_openib_port_info_t *ports = NULL;
+intrc, i;
+char *message, *offset;
+uint32_t size, size_save;
+size_t msg_size;

-size = mca_btl_openib_component.ib_num_btls * sizeof  
(mca_btl_openib_port_info_t);

-if (size != 0) {
-ports = (mca_btl_openib_port_info_t *)malloc (size);
-if (NULL == ports) {
-BTL_ERROR(("Failed malloc: %s:%d\n", __FILE__,  
__LINE__));

-return OMPI_ERR_OUT_OF_RESOURCE;
-}
+/* The message is packed into 2 parts:
+ * 1. a uint32_t indicating the number of ports in the message
+ * 2. for each port:
+ *a. the port data
+ *b. a uint32_t indicating a string length
+ *c. the string cpc list for that port, length specified by  
2b.

+ */
+msg_size = sizeof(uint32_t) +  
mca_btl_openib_component.ib_num_btls * (sizeof(uint32_t) +  
sizeof(mca_btl_openib_port_info_t));

+for (i = 0; i < mca_btl_openib_component.ib_num_btls; i++) {
+msg_size += strlen(mca_btl_openib_component.openib_btls[i]- 
>port_info.cpclist);

+}

-for (i = 0; i < mca_btl_openib_component.ib_num_btls; i++) {
-mca_btl_openib_module_t *btl =  
mca_btl_openib_component.openib_btls[i];

-ports[i] = btl->port_info;
+if (0 == msg_size) {
+return 0;
+}
+
+message = malloc(msg_size);
+if (NULL == message) {
+BTL_ERROR(("Failed malloc: %s:%d\n", __FILE__, __LINE__));
+return OMPI_ERR_OUT_OF_RESOURCE;
+}
+
+/* Pack the number of ports */
+size = mca_btl_openib_component.ib_num_btls;
#if !defined(WORDS_BIGENDIAN) && OMPI_ENABLE_HETEROGENEOUS_SUPPORT
-MCA_BTL_OPENIB_PORT_INFO_HTON(ports[i]);
+size = htonl(size);
#endif
-}
+memcpy(message, &size, sizeof(size));
+offset = message + sizeof(size);
+
+/* Pack each of the ports */
+for (i = 0; i < mca_btl_openib_component.ib_num_btls; i++) {
+/* Pack the port struct */

Re: [OMPI devel] [PATCH] openib btl: extensable cpc selection enablement

2008-01-10 Thread Jeff Squyres

On Jan 10, 2008, at 4:17 AM, Pavel Shamis (Pasha) wrote:


This patch also contains XRC compile time disablement (per Jeff's
patch).

Need make sure that CM stuff will be disabled during compile tine if  
it

is not installed on machine.


The RDMA CM and IBCM stuff has not yet been implemented (was waiting  
for this functionality first).  But configure-time checks for those  
functions will, of course, be included.  The CPC's for RDMA CM and  
IBCM will not be compiled if the support libraries are not available,  
and therefore they won't be available for selection at run-time.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [PATCH] openib btl: extensable cpc selection enablement

2008-01-10 Thread Pavel Shamis (Pasha)

Jon Mason wrote:

The new cpc selection framework is now in place.  The patch below allows
for dynamic selection of cpc methods based on what is available.  It
also allows for inclusion/exclusions of methods.  It even futher allows
for modifying the priorities of certain cpc methods to better determine
the optimal cpc method.

This patch also contains XRC compile time disablement (per Jeff's
patch).
  
Need make sure that CM stuff will be disabled during compile tine if it 
is not installed on machine.

At a high level, the cpc selections works by walking through each cpc
and allowing it to test to see if it is permissable to run on this
mpirun.  It returns a priority if it is permissable or a -1 if not.  All
of the cpc names and priorities are rolled into a string.  This string
is then encapsulated in a message and passed around all the ompi
processes.  Once received and unpacked, the list received is compared
to a local copy of the list.  The connection method is chosen by
comparing the lists passed around to all nodes via modex with the list
generated locally.  Any non-negative number is a potentially valid
connection method.  The method below of determining the optimal
connection method is to take the cross-section of the two lists.  The
highest single value (and the other side being non-negative) is selected
as the cpc method.

Please test it out.  The tree can be found at
https://svn.open-mpi.org/svn/ompi/tmp-public/openib-cpc/

This patch has been tested with IB and iWARP adapters on a 2 node system
(with it correctly choosing to use oob and happily ignoring iWARP
adapters).  It needs XRC testing and testing of larger node systems.
  

Did you run MTT over all thess changes ?
I have few machines with connectX and i will try to run MTT on Sunday.


--
Pavel Shamis (Pasha)
Mellanox Technologies



[OMPI devel] [PATCH] openib btl: extensable cpc selection enablement

2008-01-09 Thread Jon Mason
The new cpc selection framework is now in place.  The patch below allows
for dynamic selection of cpc methods based on what is available.  It
also allows for inclusion/exclusions of methods.  It even futher allows
for modifying the priorities of certain cpc methods to better determine
the optimal cpc method.

This patch also contains XRC compile time disablement (per Jeff's
patch).

At a high level, the cpc selections works by walking through each cpc
and allowing it to test to see if it is permissable to run on this
mpirun.  It returns a priority if it is permissable or a -1 if not.  All
of the cpc names and priorities are rolled into a string.  This string
is then encapsulated in a message and passed around all the ompi
processes.  Once received and unpacked, the list received is compared
to a local copy of the list.  The connection method is chosen by
comparing the lists passed around to all nodes via modex with the list
generated locally.  Any non-negative number is a potentially valid
connection method.  The method below of determining the optimal
connection method is to take the cross-section of the two lists.  The
highest single value (and the other side being non-negative) is selected
as the cpc method.

Please test it out.  The tree can be found at
https://svn.open-mpi.org/svn/ompi/tmp-public/openib-cpc/

This patch has been tested with IB and iWARP adapters on a 2 node system
(with it correctly choosing to use oob and happily ignoring iWARP
adapters).  It needs XRC testing and testing of larger node systems.

Many thanks to Jeff for all of his help.

Thanks,
Jon

Index: ompi/mca/btl/openib/btl_openib_component.c
===
--- ompi/mca/btl/openib/btl_openib_component.c  (revision 17101)
+++ ompi/mca/btl/openib/btl_openib_component.c  (working copy)
@@ -155,30 +155,70 @@
  */
 static int btl_openib_modex_send(void)
 {
-int rc, i;
-size_t  size;
-mca_btl_openib_port_info_t *ports = NULL;
+intrc, i;
+char *message, *offset;
+uint32_t size, size_save;
+size_t msg_size;

-size = mca_btl_openib_component.ib_num_btls * sizeof 
(mca_btl_openib_port_info_t);
-if (size != 0) {
-ports = (mca_btl_openib_port_info_t *)malloc (size);
-if (NULL == ports) {
-BTL_ERROR(("Failed malloc: %s:%d\n", __FILE__, __LINE__));
-return OMPI_ERR_OUT_OF_RESOURCE;
-}
+/* The message is packed into 2 parts:
+ * 1. a uint32_t indicating the number of ports in the message
+ * 2. for each port:
+ *a. the port data
+ *b. a uint32_t indicating a string length
+ *c. the string cpc list for that port, length specified by 2b.
+ */
+msg_size = sizeof(uint32_t) + mca_btl_openib_component.ib_num_btls * 
(sizeof(uint32_t) + sizeof(mca_btl_openib_port_info_t));
+for (i = 0; i < mca_btl_openib_component.ib_num_btls; i++) {
+msg_size += 
strlen(mca_btl_openib_component.openib_btls[i]->port_info.cpclist);
+}

-for (i = 0; i < mca_btl_openib_component.ib_num_btls; i++) {
-mca_btl_openib_module_t *btl = 
mca_btl_openib_component.openib_btls[i];
-ports[i] = btl->port_info;
+if (0 == msg_size) {
+return 0;
+}
+
+message = malloc(msg_size);
+if (NULL == message) {
+BTL_ERROR(("Failed malloc: %s:%d\n", __FILE__, __LINE__));
+return OMPI_ERR_OUT_OF_RESOURCE;
+} 
+
+/* Pack the number of ports */
+size = mca_btl_openib_component.ib_num_btls;
 #if !defined(WORDS_BIGENDIAN) && OMPI_ENABLE_HETEROGENEOUS_SUPPORT
-MCA_BTL_OPENIB_PORT_INFO_HTON(ports[i]);
+size = htonl(size);
 #endif
-}
+memcpy(message, &size, sizeof(size));
+offset = message + sizeof(size);
+
+/* Pack each of the ports */
+for (i = 0; i < mca_btl_openib_component.ib_num_btls; i++) {
+/* Pack the port struct */
+memcpy(offset, &mca_btl_openib_component.openib_btls[i]->port_info, 
sizeof(mca_btl_openib_port_info_t));
+#if !defined(WORDS_BIGENDIAN) && OMPI_ENABLE_HETEROGENEOUS_SUPPORT
+MCA_BTL_OPENIB_PORT_INFO_HTON(*(mca_btl_openib_port_info_t *)offset);
+#endif
+offset += sizeof(mca_btl_openib_port_info_t);
+
+/* Pack the strlen of the cpclist */
+size = size_save =
+strlen(mca_btl_openib_component.openib_btls[i]->port_info.cpclist);
+#if !defined(WORDS_BIGENDIAN) && OMPI_ENABLE_HETEROGENEOUS_SUPPORT
+size = htonl(size);
+#endif
+memcpy(offset, &size, sizeof(size));
+offset += sizeof(size);
+
+/* Pack the string */
+memcpy(offset, 
+   mca_btl_openib_component.openib_btls[i]->port_info.cpclist, 
+   size_save);
+offset += size_save;
 }
-rc = ompi_modex_send (&mca_btl_openib_component.super.btl_version, ports, 
size);
-if (NULL != ports) {
-free (ports);
-}
+
+rc = ompi_modex_send(&mca_btl_openib_comp