Re: [OMPI devel] [PATCH] openib btl: extensable cpc selection enablement
Jon Mason wrote: I have few machines with connectX and i will try to run MTT on Sunday. Awesome! I appreciate it. After fixing the compilation problem in XRC part of code I was able to run mtt. Most of the test pass and one test failed: mpi2c++_dynamics_test. The test pass without XRC. But I also see the test failed in trunk. Last time that is working is 1.3a1r17085 strange Pasha.
Re: [OMPI devel] [PATCH] openib btl: extensable cpc selection enablement
On Jan 10, 2008, at 4:17 AM, Pavel Shamis (Pasha) wrote: This patch has been tested with IB and iWARP adapters on a 2 node system (with it correctly choosing to use oob and happily ignoring iWARP adapters). It needs XRC testing and testing of larger node systems. Did you run MTT over all thess changes ? I have few machines with connectX and i will try to run MTT on Sunday. I just completed a rather large MTT run over all the same tests and variants that I run MTT on nightly tarballs every night. All looked good. Be sure to test after r17112; there was a double free() fix at r17112 that is necessary. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] [PATCH] openib btl: extensable cpc selection enablement
On Jan 10, 2008, at 11:55 AM, Jon Mason wrote: BTW, I should point out that the modex CPC string list stuff is currently somewhat wasteful in the presence of multiple ports on a host. This will definitely be bad at scale. Specifically, we'll send around a CPC string in the openib modex for *each* port. This may be repetitive (and wasteful at scale), especially if you have more than one port/NIC of the same type in each host. This can cause the modex size to increase quite a bit. While the message sent via modex is now longer, the number of messages sent is the same. So I would argue that this is only slight less optimal than the current implementation. Not at scale. Consider if someone has 2,000 8-core servers, each with a 2-port HCA. Let's assume a full-machine run of 16,000 MPI processes, each who can use 2 ports. Let's assume non-ConnectX HCAs to be conservative, so they'll all be able to use the oob CPC (someday soon, RDMA CM and IBMC will also be available, but let's start small). Each of the 16k MPI procs will have "oob"+sizeof(uint32_t) twice in their modex for a grand total of 14 extra bytes. No big deal on an individual message, but consider that that's 16,000 * 14 = 224,000 extra bytes being gathered to mpirun. Then consider that the whole pile of modex data is glommed together and broadcast to each MPI process. Hence, we're now sending an extra 16,000 * 14 * 16,000 = 3,584,000,000 bytes sent across the network during MPI_INIT (in addition to whatever is already being sent in the modex). Ralph's work on the new ORTE branch will help this quite a bit (with the routed oob stuff -- sending modex messages only once to each node, vs. once to each process), but still, the numbers are large: - gather phase: 16,000 * 14 = 224,000 extra bytes - scatter phase: 16,000 * 14 * 2,000 = 448,000 extra bytes This is much more manageable, but still -- we should be careful when we can. Switching to hashed names and index lists will save quite a bit. For example, if we do a dumb hash of the cpc name down to 1 byte and send index lists of which ports use each cpc (each index can be 1 byte -- leading to a max of 256 ports in each host, which is probably sufficient for the forseeable future!), we're down to 3 extra bytes in the modex which is much more manageable: in today's non-routed OOB: - gather phase: 16,000 * 3 = 48,000 extra bytes - scatter phase: 16,000 * 3 * 16,000 = 768,000,000 extra bytes in the soon-to-be per-host modex distribution: - gather phase: 16,000 * 3 = 48,000 extra bytes - scatter phase: 16,000 * 3 * 2,000 = 96,000,000 extra bytes Additionally, the routed oob makes the reality even better than that, because it uses a tree distribution for the modex. So although the raw number of bytes is the same as a per-host-but-not-routed modex distribution, the distribution is quite wide, potentially avoiding network congestion (because different ports/links/servers are involved, all in parallel). -- Jeff Squyres Cisco Systems
Re: [OMPI devel] [PATCH] openib btl: extensable cpc selection enablement
On Thu, Jan 10, 2008 at 09:58:35AM -0500, Jeff Squyres wrote: > BTW, I should point out that the modex CPC string list stuff is > currently somewhat wasteful in the presence of multiple ports on a > host. This will definitely be bad at scale. Specifically, we'll send > around a CPC string in the openib modex for *each* port. This may be > repetitive (and wasteful at scale), especially if you have more than > one port/NIC of the same type in each host. This can cause the modex > size to increase quite a bit. > While the message sent via modex is now longer, the number of messages sent is the same. So I would argue that this is only slight less optimal than the current implementation. > We wanted to get this new scheme *working* and then optimize the modex > message string usage a bit. Options for optimization include: > > 1. list the cpc names only once in the modex message, each followed by > an indexed list of which ports use them > > 2. use a simple hashing function to eliminate the string names in the > modex altogether -- and for extra bonus points, combine with #1 to > only list the cpc's once > > 3. ...? 4. Profit! > > I'd say that this optimization is pretty important for v1.3 (but it > shouldn't be hard to do). > > > On Jan 9, 2008, at 6:37 PM, Jon Mason wrote: > > > The new cpc selection framework is now in place. The patch below > > allows > > for dynamic selection of cpc methods based on what is available. It > > also allows for inclusion/exclusions of methods. It even futher > > allows > > for modifying the priorities of certain cpc methods to better > > determine > > the optimal cpc method. > > > > This patch also contains XRC compile time disablement (per Jeff's > > patch). > > > > At a high level, the cpc selections works by walking through each cpc > > and allowing it to test to see if it is permissable to run on this > > mpirun. It returns a priority if it is permissable or a -1 if not. > > All > > of the cpc names and priorities are rolled into a string. This string > > is then encapsulated in a message and passed around all the ompi > > processes. Once received and unpacked, the list received is compared > > to a local copy of the list. The connection method is chosen by > > comparing the lists passed around to all nodes via modex with the list > > generated locally. Any non-negative number is a potentially valid > > connection method. The method below of determining the optimal > > connection method is to take the cross-section of the two lists. The > > highest single value (and the other side being non-negative) is > > selected > > as the cpc method. > > > > Please test it out. The tree can be found at > > https://svn.open-mpi.org/svn/ompi/tmp-public/openib-cpc/ > > > > This patch has been tested with IB and iWARP adapters on a 2 node > > system > > (with it correctly choosing to use oob and happily ignoring iWARP > > adapters). It needs XRC testing and testing of larger node systems. > > > > Many thanks to Jeff for all of his help. > > > > Thanks, > > Jon > > > > Index: ompi/mca/btl/openib/btl_openib_component.c > > === > > --- ompi/mca/btl/openib/btl_openib_component.c (revision 17101) > > +++ ompi/mca/btl/openib/btl_openib_component.c (working copy) > > @@ -155,30 +155,70 @@ > > */ > > static int btl_openib_modex_send(void) > > { > > -int rc, i; > > -size_t size; > > -mca_btl_openib_port_info_t *ports = NULL; > > +intrc, i; > > +char *message, *offset; > > +uint32_t size, size_save; > > +size_t msg_size; > > > > -size = mca_btl_openib_component.ib_num_btls * sizeof > > (mca_btl_openib_port_info_t); > > -if (size != 0) { > > -ports = (mca_btl_openib_port_info_t *)malloc (size); > > -if (NULL == ports) { > > -BTL_ERROR(("Failed malloc: %s:%d\n", __FILE__, > > __LINE__)); > > -return OMPI_ERR_OUT_OF_RESOURCE; > > -} > > +/* The message is packed into 2 parts: > > + * 1. a uint32_t indicating the number of ports in the message > > + * 2. for each port: > > + *a. the port data > > + *b. a uint32_t indicating a string length > > + *c. the string cpc list for that port, length specified by > > 2b. > > + */ > > +msg_size = sizeof(uint32_t) + > > mca_btl_openib_component.ib_num_btls * (sizeof(uint32_t) + > > sizeof(mca_btl_openib_port_info_t)); > > +for (i = 0; i < mca_btl_openib_component.ib_num_btls; i++) { > > +msg_size += strlen(mca_btl_openib_component.openib_btls[i]- > > >port_info.cpclist); > > +} > > > > -for (i = 0; i < mca_btl_openib_component.ib_num_btls; i++) { > > -mca_btl_openib_module_t *btl = > > mca_btl_openib_component.openib_btls[i]; > > -ports[i] = btl->port_info; > > +if (0 == msg_size) { > > +return 0; > > +} > >
Re: [OMPI devel] [PATCH] openib btl: extensable cpc selection enablement
On Thu, Jan 10, 2008 at 11:17:48AM +0200, Pavel Shamis (Pasha) wrote: > Jon Mason wrote: > > The new cpc selection framework is now in place. The patch below allows > > for dynamic selection of cpc methods based on what is available. It > > also allows for inclusion/exclusions of methods. It even futher allows > > for modifying the priorities of certain cpc methods to better determine > > the optimal cpc method. > > > > This patch also contains XRC compile time disablement (per Jeff's > > patch). > > > Need make sure that CM stuff will be disabled during compile tine if it > is not installed on machine. Oh, xoob doesn't show up at all for me. So I'd say it is working. :-) > > At a high level, the cpc selections works by walking through each cpc > > and allowing it to test to see if it is permissable to run on this > > mpirun. It returns a priority if it is permissable or a -1 if not. All > > of the cpc names and priorities are rolled into a string. This string > > is then encapsulated in a message and passed around all the ompi > > processes. Once received and unpacked, the list received is compared > > to a local copy of the list. The connection method is chosen by > > comparing the lists passed around to all nodes via modex with the list > > generated locally. Any non-negative number is a potentially valid > > connection method. The method below of determining the optimal > > connection method is to take the cross-section of the two lists. The > > highest single value (and the other side being non-negative) is selected > > as the cpc method. > > > > Please test it out. The tree can be found at > > https://svn.open-mpi.org/svn/ompi/tmp-public/openib-cpc/ > > > > This patch has been tested with IB and iWARP adapters on a 2 node system > > (with it correctly choosing to use oob and happily ignoring iWARP > > adapters). It needs XRC testing and testing of larger node systems. > > > Did you run MTT over all thess changes ? No I did not. My thinking was that since this is just setting up the connections, it'll either work or not. Since MTT is more functional testing, it wouldn't really stress it. But it could be false reasoning on my part. My main test was running the full IMB suite on it. Of course, as soon as I get rdma cm in a working state, I'll be running MTT to stress that. > I have few machines with connectX and i will try to run MTT on Sunday. Awesome! I appreciate it. Thanks, Jon > -- > Pavel Shamis (Pasha) > Mellanox Technologies > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [PATCH] openib btl: extensable cpc selection enablement
BTW, I should point out that the modex CPC string list stuff is currently somewhat wasteful in the presence of multiple ports on a host. This will definitely be bad at scale. Specifically, we'll send around a CPC string in the openib modex for *each* port. This may be repetitive (and wasteful at scale), especially if you have more than one port/NIC of the same type in each host. This can cause the modex size to increase quite a bit. We wanted to get this new scheme *working* and then optimize the modex message string usage a bit. Options for optimization include: 1. list the cpc names only once in the modex message, each followed by an indexed list of which ports use them 2. use a simple hashing function to eliminate the string names in the modex altogether -- and for extra bonus points, combine with #1 to only list the cpc's once 3. ...? I'd say that this optimization is pretty important for v1.3 (but it shouldn't be hard to do). On Jan 9, 2008, at 6:37 PM, Jon Mason wrote: The new cpc selection framework is now in place. The patch below allows for dynamic selection of cpc methods based on what is available. It also allows for inclusion/exclusions of methods. It even futher allows for modifying the priorities of certain cpc methods to better determine the optimal cpc method. This patch also contains XRC compile time disablement (per Jeff's patch). At a high level, the cpc selections works by walking through each cpc and allowing it to test to see if it is permissable to run on this mpirun. It returns a priority if it is permissable or a -1 if not. All of the cpc names and priorities are rolled into a string. This string is then encapsulated in a message and passed around all the ompi processes. Once received and unpacked, the list received is compared to a local copy of the list. The connection method is chosen by comparing the lists passed around to all nodes via modex with the list generated locally. Any non-negative number is a potentially valid connection method. The method below of determining the optimal connection method is to take the cross-section of the two lists. The highest single value (and the other side being non-negative) is selected as the cpc method. Please test it out. The tree can be found at https://svn.open-mpi.org/svn/ompi/tmp-public/openib-cpc/ This patch has been tested with IB and iWARP adapters on a 2 node system (with it correctly choosing to use oob and happily ignoring iWARP adapters). It needs XRC testing and testing of larger node systems. Many thanks to Jeff for all of his help. Thanks, Jon Index: ompi/mca/btl/openib/btl_openib_component.c === --- ompi/mca/btl/openib/btl_openib_component.c (revision 17101) +++ ompi/mca/btl/openib/btl_openib_component.c (working copy) @@ -155,30 +155,70 @@ */ static int btl_openib_modex_send(void) { -int rc, i; -size_t size; -mca_btl_openib_port_info_t *ports = NULL; +intrc, i; +char *message, *offset; +uint32_t size, size_save; +size_t msg_size; -size = mca_btl_openib_component.ib_num_btls * sizeof (mca_btl_openib_port_info_t); -if (size != 0) { -ports = (mca_btl_openib_port_info_t *)malloc (size); -if (NULL == ports) { -BTL_ERROR(("Failed malloc: %s:%d\n", __FILE__, __LINE__)); -return OMPI_ERR_OUT_OF_RESOURCE; -} +/* The message is packed into 2 parts: + * 1. a uint32_t indicating the number of ports in the message + * 2. for each port: + *a. the port data + *b. a uint32_t indicating a string length + *c. the string cpc list for that port, length specified by 2b. + */ +msg_size = sizeof(uint32_t) + mca_btl_openib_component.ib_num_btls * (sizeof(uint32_t) + sizeof(mca_btl_openib_port_info_t)); +for (i = 0; i < mca_btl_openib_component.ib_num_btls; i++) { +msg_size += strlen(mca_btl_openib_component.openib_btls[i]- >port_info.cpclist); +} -for (i = 0; i < mca_btl_openib_component.ib_num_btls; i++) { -mca_btl_openib_module_t *btl = mca_btl_openib_component.openib_btls[i]; -ports[i] = btl->port_info; +if (0 == msg_size) { +return 0; +} + +message = malloc(msg_size); +if (NULL == message) { +BTL_ERROR(("Failed malloc: %s:%d\n", __FILE__, __LINE__)); +return OMPI_ERR_OUT_OF_RESOURCE; +} + +/* Pack the number of ports */ +size = mca_btl_openib_component.ib_num_btls; #if !defined(WORDS_BIGENDIAN) && OMPI_ENABLE_HETEROGENEOUS_SUPPORT -MCA_BTL_OPENIB_PORT_INFO_HTON(ports[i]); +size = htonl(size); #endif -} +memcpy(message, &size, sizeof(size)); +offset = message + sizeof(size); + +/* Pack each of the ports */ +for (i = 0; i < mca_btl_openib_component.ib_num_btls; i++) { +/* Pack the port struct */
Re: [OMPI devel] [PATCH] openib btl: extensable cpc selection enablement
On Jan 10, 2008, at 4:17 AM, Pavel Shamis (Pasha) wrote: This patch also contains XRC compile time disablement (per Jeff's patch). Need make sure that CM stuff will be disabled during compile tine if it is not installed on machine. The RDMA CM and IBCM stuff has not yet been implemented (was waiting for this functionality first). But configure-time checks for those functions will, of course, be included. The CPC's for RDMA CM and IBCM will not be compiled if the support libraries are not available, and therefore they won't be available for selection at run-time. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] [PATCH] openib btl: extensable cpc selection enablement
Jon Mason wrote: The new cpc selection framework is now in place. The patch below allows for dynamic selection of cpc methods based on what is available. It also allows for inclusion/exclusions of methods. It even futher allows for modifying the priorities of certain cpc methods to better determine the optimal cpc method. This patch also contains XRC compile time disablement (per Jeff's patch). Need make sure that CM stuff will be disabled during compile tine if it is not installed on machine. At a high level, the cpc selections works by walking through each cpc and allowing it to test to see if it is permissable to run on this mpirun. It returns a priority if it is permissable or a -1 if not. All of the cpc names and priorities are rolled into a string. This string is then encapsulated in a message and passed around all the ompi processes. Once received and unpacked, the list received is compared to a local copy of the list. The connection method is chosen by comparing the lists passed around to all nodes via modex with the list generated locally. Any non-negative number is a potentially valid connection method. The method below of determining the optimal connection method is to take the cross-section of the two lists. The highest single value (and the other side being non-negative) is selected as the cpc method. Please test it out. The tree can be found at https://svn.open-mpi.org/svn/ompi/tmp-public/openib-cpc/ This patch has been tested with IB and iWARP adapters on a 2 node system (with it correctly choosing to use oob and happily ignoring iWARP adapters). It needs XRC testing and testing of larger node systems. Did you run MTT over all thess changes ? I have few machines with connectX and i will try to run MTT on Sunday. -- Pavel Shamis (Pasha) Mellanox Technologies
[OMPI devel] [PATCH] openib btl: extensable cpc selection enablement
The new cpc selection framework is now in place. The patch below allows for dynamic selection of cpc methods based on what is available. It also allows for inclusion/exclusions of methods. It even futher allows for modifying the priorities of certain cpc methods to better determine the optimal cpc method. This patch also contains XRC compile time disablement (per Jeff's patch). At a high level, the cpc selections works by walking through each cpc and allowing it to test to see if it is permissable to run on this mpirun. It returns a priority if it is permissable or a -1 if not. All of the cpc names and priorities are rolled into a string. This string is then encapsulated in a message and passed around all the ompi processes. Once received and unpacked, the list received is compared to a local copy of the list. The connection method is chosen by comparing the lists passed around to all nodes via modex with the list generated locally. Any non-negative number is a potentially valid connection method. The method below of determining the optimal connection method is to take the cross-section of the two lists. The highest single value (and the other side being non-negative) is selected as the cpc method. Please test it out. The tree can be found at https://svn.open-mpi.org/svn/ompi/tmp-public/openib-cpc/ This patch has been tested with IB and iWARP adapters on a 2 node system (with it correctly choosing to use oob and happily ignoring iWARP adapters). It needs XRC testing and testing of larger node systems. Many thanks to Jeff for all of his help. Thanks, Jon Index: ompi/mca/btl/openib/btl_openib_component.c === --- ompi/mca/btl/openib/btl_openib_component.c (revision 17101) +++ ompi/mca/btl/openib/btl_openib_component.c (working copy) @@ -155,30 +155,70 @@ */ static int btl_openib_modex_send(void) { -int rc, i; -size_t size; -mca_btl_openib_port_info_t *ports = NULL; +intrc, i; +char *message, *offset; +uint32_t size, size_save; +size_t msg_size; -size = mca_btl_openib_component.ib_num_btls * sizeof (mca_btl_openib_port_info_t); -if (size != 0) { -ports = (mca_btl_openib_port_info_t *)malloc (size); -if (NULL == ports) { -BTL_ERROR(("Failed malloc: %s:%d\n", __FILE__, __LINE__)); -return OMPI_ERR_OUT_OF_RESOURCE; -} +/* The message is packed into 2 parts: + * 1. a uint32_t indicating the number of ports in the message + * 2. for each port: + *a. the port data + *b. a uint32_t indicating a string length + *c. the string cpc list for that port, length specified by 2b. + */ +msg_size = sizeof(uint32_t) + mca_btl_openib_component.ib_num_btls * (sizeof(uint32_t) + sizeof(mca_btl_openib_port_info_t)); +for (i = 0; i < mca_btl_openib_component.ib_num_btls; i++) { +msg_size += strlen(mca_btl_openib_component.openib_btls[i]->port_info.cpclist); +} -for (i = 0; i < mca_btl_openib_component.ib_num_btls; i++) { -mca_btl_openib_module_t *btl = mca_btl_openib_component.openib_btls[i]; -ports[i] = btl->port_info; +if (0 == msg_size) { +return 0; +} + +message = malloc(msg_size); +if (NULL == message) { +BTL_ERROR(("Failed malloc: %s:%d\n", __FILE__, __LINE__)); +return OMPI_ERR_OUT_OF_RESOURCE; +} + +/* Pack the number of ports */ +size = mca_btl_openib_component.ib_num_btls; #if !defined(WORDS_BIGENDIAN) && OMPI_ENABLE_HETEROGENEOUS_SUPPORT -MCA_BTL_OPENIB_PORT_INFO_HTON(ports[i]); +size = htonl(size); #endif -} +memcpy(message, &size, sizeof(size)); +offset = message + sizeof(size); + +/* Pack each of the ports */ +for (i = 0; i < mca_btl_openib_component.ib_num_btls; i++) { +/* Pack the port struct */ +memcpy(offset, &mca_btl_openib_component.openib_btls[i]->port_info, sizeof(mca_btl_openib_port_info_t)); +#if !defined(WORDS_BIGENDIAN) && OMPI_ENABLE_HETEROGENEOUS_SUPPORT +MCA_BTL_OPENIB_PORT_INFO_HTON(*(mca_btl_openib_port_info_t *)offset); +#endif +offset += sizeof(mca_btl_openib_port_info_t); + +/* Pack the strlen of the cpclist */ +size = size_save = +strlen(mca_btl_openib_component.openib_btls[i]->port_info.cpclist); +#if !defined(WORDS_BIGENDIAN) && OMPI_ENABLE_HETEROGENEOUS_SUPPORT +size = htonl(size); +#endif +memcpy(offset, &size, sizeof(size)); +offset += sizeof(size); + +/* Pack the string */ +memcpy(offset, + mca_btl_openib_component.openib_btls[i]->port_info.cpclist, + size_save); +offset += size_save; } -rc = ompi_modex_send (&mca_btl_openib_component.super.btl_version, ports, size); -if (NULL != ports) { -free (ports); -} + +rc = ompi_modex_send(&mca_btl_openib_comp