Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r24395
You would think but I did not want to speculate what OFED might do. I'm fine skipping the Solaris check, if OFED does include things may have to change at that point anyway. On 02/16/11 09:41, Jeff Squyres wrote: If OFED includes that constant, wouldn't we want to use it? PCI ordering is PCI ordering (i.e., unreliable) on all hardware -- or am I wrong? On Feb 16, 2011, at 8:59 AM, Don Kerr wrote: I considered that but I wanted to guard against future OFED inclusion. Removing the Solaris check is easy enough. On 02/16/11 08:49, Jeff Squyres wrote: On Feb 16, 2011, at 8:29 AM, Don Kerr wrote: Yes this is Solaris only. OFED has not bought back the IBV_ACCESS_SO flag. Not sure they ever will. It should be sufficient to AC_CHECK_DECLS then -- no need for the additional Solaris check.
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r24395
Yes this is Solaris only. OFED has not bought back the IBV_ACCESS_SO flag. Not sure they ever will. On 02/16/11 08:15, Jeff Squyres wrote: Oracle -- Is this really only specific to Solaris? More comments below about configure.m4. On Feb 16, 2011, at 12:37 AM, dk...@osl.iu.edu wrote: Author: dkerr Date: 2011-02-16 00:37:22 EST (Wed, 16 Feb 2011) New Revision: 24395 URL: https://svn.open-mpi.org/trac/ompi/changeset/24395 Log: on Solaris, when IBV_ACCESS_SO is available, use strong ordered memory region for eager rdma connection Text files modified: trunk/ompi/mca/btl/openib/btl_openib_component.c |13 ++--- trunk/ompi/mca/btl/openib/btl_openib_endpoint.c |19 +-- trunk/ompi/mca/btl/openib/configure.m4 |16 +++- 3 files changed, 42 insertions(+), 6 deletions(-) Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c == --- trunk/ompi/mca/btl/openib/btl_openib_component.c(original) +++ trunk/ompi/mca/btl/openib/btl_openib_component.c2011-02-16 00:37:22 EST (Wed, 16 Feb 2011) @@ -15,7 +15,7 @@ * Copyright (c) 2006-2007 Los Alamos National Security, LLC. All rights * reserved. * Copyright (c) 2006-2007 Voltaire All rights reserved. - * Copyright (c) 2009-2010 Oracle and/or its affiliates. All rights reserved. + * Copyright (c) 2009-2011 Oracle and/or its affiliates. All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -527,9 +527,16 @@ { mca_btl_openib_device_t *device = (mca_btl_openib_device_t*)reg_data; mca_btl_openib_reg_t *openib_reg = (mca_btl_openib_reg_t*)reg; +enum ibv_access_flags access_flag = IBV_ACCESS_LOCAL_WRITE | +IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ; -openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, IBV_ACCESS_LOCAL_WRITE | -IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ); +#if defined(HAVE_IBV_ACCESS_SO) +if (reg->flags & MCA_MPOOL_FLAGS_SO_MEM) { +access_flag |= IBV_ACCESS_SO; +} +#endif + +openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag); if (NULL == openib_reg->mr) { return OMPI_ERR_OUT_OF_RESOURCE; Modified: trunk/ompi/mca/btl/openib/btl_openib_endpoint.c == --- trunk/ompi/mca/btl/openib/btl_openib_endpoint.c (original) +++ trunk/ompi/mca/btl/openib/btl_openib_endpoint.c 2011-02-16 00:37:22 EST (Wed, 16 Feb 2011) @@ -16,7 +16,7 @@ * Copyright (c) 2006-2007 Voltaire All rights reserved. * Copyright (c) 2006-2009 Mellanox Technologies, Inc. All rights reserved. * Copyright (c) 2010 IBM Corporation. All rights reserved. - * Copyright (c) 2010 Oracle and/or its affiliates. All rights reserved + * Copyright (c) 2010-2011 Oracle and/or its affiliates. All rights reserved * * $COPYRIGHT$ * @@ -911,6 +911,7 @@ char *buf; mca_btl_openib_recv_frag_t *headers_buf; int i; +uint32_t flag = MCA_MPOOL_FLAGS_CACHE_BYPASS; /* Set local rdma pointer to 1 temporarily so other threads will not try * to enter the function */ @@ -925,11 +926,25 @@ if(NULL == headers_buf) goto unlock_rdma_local; +#if defined(HAVE_IBV_ACCESS_SO) +/* Solaris implements the Relaxed Ordering feature defined in the + PCI Specification. With this in mind any memory region which + relies on a buffer being written in a specific order, for + example the eager rdma connections created in this routinue, + must set a strong order flag when registering the memory for + rdma operations. + + The following flag will be interpreted and the appropriate + steps will be taken when the memory is registered in + openib_reg_mr(). */ +flag |= MCA_MPOOL_FLAGS_SO_MEM; +#endif + buf = (char *) openib_btl->super.btl_mpool->mpool_alloc(openib_btl->super.btl_mpool, openib_btl->eager_rdma_frag_size * mca_btl_openib_component.eager_rdma_num, mca_btl_openib_component.buffer_alignment, -MCA_MPOOL_FLAGS_CACHE_BYPASS, +flag, (mca_mpool_base_registration_t**)>eager_rdma_local.reg); if(!buf) Modified: trunk/ompi/mca/btl/openib/configure.m4 == --- trunk/ompi/mca/btl/openib/configure.m4 (original) +++ trunk/ompi/mca/btl/openib/configure.m4 2011-02-16 00:37:22 EST (Wed, 16 Feb 2011) @@ -12,6 +12,7 @@ # All rights reserved. # Copyright (c) 2007-2010 Cisco Systems, Inc. All rights reserved. # Copyright (c) 2008 Mellanox Technologies. All rights reserved. +# Copyright (c) 2011 Oracle and/or its affiliates. All rights reserved. # $COPYRIGHT$ # # Additional copyrights may
Re: [OMPI devel] trac #2034 : single rail openib btl shows better bandwidth than dual rail (12k< x < 128k)
On 10/08/09 17:14, Don Kerr wrote: George, This is an interesting approach although I am guessing the changes would be wide spread and have many performance implications. Am I wrong in this belief? My point here is that if this is going to have as many performance implications as I think it will, it probably makes sense to investigate the potential bigger dual-rail issue and consider the "never share" approach in the larger context. -DON -DON On 10/08/09 11:45, George Bosilca wrote: Don, I think we can do something slightly different that will satisfy everybody. How about a solution where each BTL will define a limit where a message will never be shared with another BTL? We can have two such limits, one for the send protocol and one for the RMA (it will apply either to PUT or GET operations based on the BTL support and PML decision). george. On Oct 8, 2009, at 11:01 , Don Kerr wrote: On 10/07/09 13:52, George Bosilca wrote: Don, The problem is that a particular BTL doesn't have the knowledge about the other selected BTL, so allowing the BTLs to set this limit is not as easy as it sound. However, in the case two identical BTLs are selected and that they are the only ones, this clearly is a better approach. If this parameter is set at the PML level, I can't imagine how we figure out the correct value depending on the BTLs. I see this as a pretty strong restriction. How do we know we set a value that make sense? OK, I now see why setting at btl level is difficult. And for the case of multiple btls which are also different component types, however unlikely that is, a pml setting will not be optimal for both. -DON george. On Oct 7, 2009, at 10:19 , Don Kerr wrote: George, Were you suggesting that the proposed new parameter "max_rdma_single_rget" be set by the individual btls similar to "btl_eager_limit"? Seems to me to that is the better approach if I am to move forward with this. -DON On 10/06/09 11:14, Don Kerr wrote: I agree there is probably a larger issue here and yes this is somewhat specific but where as OB1 appears to have multiple protocols depending on the capabilities of the BTLs I would not characterize as an IB centric problem. Maybe OB1 RDMA problem. There is a clear benefit from modifying this specific case. Do you think its not worth making incremental improvements while also attacking a potential bigger issue? -DON On 10/06/09 10:52, George Bosilca wrote: Don, This seems a very IB centric problem (and solution) going up in the PML. Moreover, I noticed that independent on the BTL we have some problems with the multi-rail performance. As an example on a cluster with 3 GB cards we get the same performance is I enable 2 or 3. Didn't had time to look into the details, but this might be a more general problem. george. On Oct 6, 2009, at 09:51 , Don Kerr wrote: I intend to make the change suggested in this ticket to the trunk. The change does not impact single rail, tested with openib btl, case and does improve dual rail case. Since it does involve performance and I am adding a OB1 mca parameter just wanted to check if anyone was interested or had an issue with it before I committed the change. -DON ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trac #2034 : single rail openib btl shows better bandwidth than dual rail (12k< x < 128k)
George, This is an interesting approach although I am guessing the changes would be wide spread and have many performance implications. Am I wrong in this belief? -DON On 10/08/09 11:45, George Bosilca wrote: Don, I think we can do something slightly different that will satisfy everybody. How about a solution where each BTL will define a limit where a message will never be shared with another BTL? We can have two such limits, one for the send protocol and one for the RMA (it will apply either to PUT or GET operations based on the BTL support and PML decision). george. On Oct 8, 2009, at 11:01 , Don Kerr wrote: On 10/07/09 13:52, George Bosilca wrote: Don, The problem is that a particular BTL doesn't have the knowledge about the other selected BTL, so allowing the BTLs to set this limit is not as easy as it sound. However, in the case two identical BTLs are selected and that they are the only ones, this clearly is a better approach. If this parameter is set at the PML level, I can't imagine how we figure out the correct value depending on the BTLs. I see this as a pretty strong restriction. How do we know we set a value that make sense? OK, I now see why setting at btl level is difficult. And for the case of multiple btls which are also different component types, however unlikely that is, a pml setting will not be optimal for both. -DON george. On Oct 7, 2009, at 10:19 , Don Kerr wrote: George, Were you suggesting that the proposed new parameter "max_rdma_single_rget" be set by the individual btls similar to "btl_eager_limit"? Seems to me to that is the better approach if I am to move forward with this. -DON On 10/06/09 11:14, Don Kerr wrote: I agree there is probably a larger issue here and yes this is somewhat specific but where as OB1 appears to have multiple protocols depending on the capabilities of the BTLs I would not characterize as an IB centric problem. Maybe OB1 RDMA problem. There is a clear benefit from modifying this specific case. Do you think its not worth making incremental improvements while also attacking a potential bigger issue? -DON On 10/06/09 10:52, George Bosilca wrote: Don, This seems a very IB centric problem (and solution) going up in the PML. Moreover, I noticed that independent on the BTL we have some problems with the multi-rail performance. As an example on a cluster with 3 GB cards we get the same performance is I enable 2 or 3. Didn't had time to look into the details, but this might be a more general problem. george. On Oct 6, 2009, at 09:51 , Don Kerr wrote: I intend to make the change suggested in this ticket to the trunk. The change does not impact single rail, tested with openib btl, case and does improve dual rail case. Since it does involve performance and I am adding a OB1 mca parameter just wanted to check if anyone was interested or had an issue with it before I committed the change. -DON ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trac #2034 : single rail openib btl shows better bandwidth than dual rail (12k< x < 128k)
On 10/07/09 13:52, George Bosilca wrote: Don, The problem is that a particular BTL doesn't have the knowledge about the other selected BTL, so allowing the BTLs to set this limit is not as easy as it sound. However, in the case two identical BTLs are selected and that they are the only ones, this clearly is a better approach. If this parameter is set at the PML level, I can't imagine how we figure out the correct value depending on the BTLs. I see this as a pretty strong restriction. How do we know we set a value that make sense? OK, I now see why setting at btl level is difficult. And for the case of multiple btls which are also different component types, however unlikely that is, a pml setting will not be optimal for both. -DON george. On Oct 7, 2009, at 10:19 , Don Kerr wrote: George, Were you suggesting that the proposed new parameter "max_rdma_single_rget" be set by the individual btls similar to "btl_eager_limit"? Seems to me to that is the better approach if I am to move forward with this. -DON On 10/06/09 11:14, Don Kerr wrote: I agree there is probably a larger issue here and yes this is somewhat specific but where as OB1 appears to have multiple protocols depending on the capabilities of the BTLs I would not characterize as an IB centric problem. Maybe OB1 RDMA problem. There is a clear benefit from modifying this specific case. Do you think its not worth making incremental improvements while also attacking a potential bigger issue? -DON On 10/06/09 10:52, George Bosilca wrote: Don, This seems a very IB centric problem (and solution) going up in the PML. Moreover, I noticed that independent on the BTL we have some problems with the multi-rail performance. As an example on a cluster with 3 GB cards we get the same performance is I enable 2 or 3. Didn't had time to look into the details, but this might be a more general problem. george. On Oct 6, 2009, at 09:51 , Don Kerr wrote: I intend to make the change suggested in this ticket to the trunk. The change does not impact single rail, tested with openib btl, case and does improve dual rail case. Since it does involve performance and I am adding a OB1 mca parameter just wanted to check if anyone was interested or had an issue with it before I committed the change. -DON ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trac #2034 : single rail openib btl shows better bandwidth than dual rail (12k< x < 128k)
George, Were you suggesting that the proposed new parameter "max_rdma_single_rget" be set by the individual btls similar to "btl_eager_limit"? Seems to me to that is the better approach if I am to move forward with this. -DON On 10/06/09 11:14, Don Kerr wrote: I agree there is probably a larger issue here and yes this is somewhat specific but where as OB1 appears to have multiple protocols depending on the capabilities of the BTLs I would not characterize as an IB centric problem. Maybe OB1 RDMA problem. There is a clear benefit from modifying this specific case. Do you think its not worth making incremental improvements while also attacking a potential bigger issue? -DON On 10/06/09 10:52, George Bosilca wrote: Don, This seems a very IB centric problem (and solution) going up in the PML. Moreover, I noticed that independent on the BTL we have some problems with the multi-rail performance. As an example on a cluster with 3 GB cards we get the same performance is I enable 2 or 3. Didn't had time to look into the details, but this might be a more general problem. george. On Oct 6, 2009, at 09:51 , Don Kerr wrote: I intend to make the change suggested in this ticket to the trunk. The change does not impact single rail, tested with openib btl, case and does improve dual rail case. Since it does involve performance and I am adding a OB1 mca parameter just wanted to check if anyone was interested or had an issue with it before I committed the change. -DON ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] trac #2034 : single rail openib btl shows better bandwidth than dual rail (12k< x < 128k)
I intend to make the change suggested in this ticket to the trunk. The change does not impact single rail, tested with openib btl, case and does improve dual rail case. Since it does involve performance and I am adding a OB1 mca parameter just wanted to check if anyone was interested or had an issue with it before I committed the change. -DON
Re: [OMPI devel] BTL receive callback
Hello Sebastian, Sounds like you are using the openib btl as a starting point, which is a good place to start. I am curious if you are indeed using a new interconnect (new hardware and protocol) or if it is requirements of the 3D-torus network that are not addressed by the openib btl that are driving the need for a new btl? -DON On 07/21/09 11:55, Sebastian Rinke wrote: Hello, I am developing a new BTL component (Open MPI v1.3.2) for a new 3D-torus interconnect. During a simple message transfer of 16362 B between two nodes with MPI_Send(), MPI_Recv() I encounter the following: The sender: --- 1. prepare_src() size: 16304 reserve: 32 -> alloc() size: 16336 -> ompi_convertor_pack(): 16304 2. send() 3. component_progress() -> send cb () -> free() 4. component_progress() -> recv cb () -> prepare_src() size: 58 reserve: 32 -> alloc() size: 90 -> ompi_convertor_pack(): 58 -> free() size: 90 Send is missing !!! 5. NO PROGRESS The receiver: - 1. component_progress() -> recv cb () -> alloc() size: 32 -> send() 2. component_progress() -> send cb () -> free() size: 32 3. component_progress() for ever !!! The problem is that after prepare_src() for the 2nd fragment, the sender calls free() instead of send() in its recv cb. Thus, the 2nd fragment is not being transmitted. As a consequence, the receiver waits for the 2nd fragment. I have found that mca_pml_ob1_recv_frag_callback_ack() is the corresponding recv cb. Before diving into the ob1 code, could you tell me under which conditions this cb calls free() instead of send() so that I can get an idea of where to look for errors in my BTL component. Thank you very much in advance. Sebastian Rinke ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Trunk Heads Up
Sorry. I missed my lashing while on the phone. Thanks George, Thanks Jeff. George Bosilca wrote: r19218 fixes this problem. I couldn't wait so I fix it myself. george. On Aug 7, 2008, at 7:38 PM, Jeff Squyres wrote: There's a missing $2 in the configure.m4. Don actually did ask for a review from Brian and me, and I gave a quick one. My bad for missing it. I'm testing to ensure the fix is right, and then I'll commit. On Aug 7, 2008, at 1:05 PM, George Bosilca wrote: Well, the commit itself doesn't modify the build process, as you just added a new component. However, if people autogen, you component doesn't correctly disable itself when not on Solaris. As a result, the build fails on MAC OS X. Here is the error I get at build time: ranlib: file: .libs/libmca_memchecker.a(memchecker_base_wrappers.o) has no symbols ../../../../../ompi/opal/mca/memory/malloc_solaris/memory_malloc_solaris_component.c:94: error: conflicting types for ‘munmap’ /usr/include/sys/mman.h:212: error: previous declaration of ‘munmap’ was here ../../../../../ompi/opal/mca/memory/malloc_solaris/memory_malloc_solaris_component.c:118:6: error: #error "Can not determine how to call munmap" And here is a snippet from the config.log: configure:78271: checking for Solaris configure:78988: result: no configure:79050: checking if MCA component memory:malloc_solaris can compile configure:79052: result: yes george. On Aug 7, 2008, at 6:07 PM, Jeff Squyres wrote: Eh. Damage is done. Leave it in. We'll whip you later. ;-) On Aug 7, 2008, at 12:04 PM, Don Kerr wrote: All, I just did a commit (-r19217) which I believe will require an autogen. Since I was reminded that this is not good citizen behavior for the middle of the day I will now start figuring out how to back this out unless someone beats me to it. -DON (with head hung low) ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Trunk Heads Up
All, I just did a commit (-r19217) which I believe will require an autogen. Since I was reminded that this is not good citizen behavior for the middle of the day I will now start figuring out how to back this out unless someone beats me to it. -DON (with head hung low)
Re: [OMPI devel] IBCM error
Jeff Squyres wrote: On Jul 15, 2008, at 7:30 AM, Ralph Castain wrote: Minor clarification: we did not test RDMACM on RoadRunner. Just for further clarification - I did, and it wasn't a particularly good experience. Encountered several problems, none of them overwhelming, hence my comments. Ah -- I didn't know this. What went wrong? We need to fix it if there are problems. RDMACM, on the other hand, is *necessary* for iWARP connections. We know it won't scale well because of ARP issues, to which the iWARP vendors are publishing their own solutions (pre-populating ARP caches, etc.). Even when built and installed, RDMACM will not be used by default for IB hardware (you have to specifically ask for it). Since it's necessary for iWARP, I think we need to build and install it by default. Most importantly: production IB users won't be disturbed. If it is necessary for iWARP, then fine - so long as it is only used if specifically requested. However, I would also ask that we be able to -not- build it upon request so we can be certain a user doesn't attempt to use it by mistake ("gee, that looks interesting - let Mikey try it!"). Ditto for ibcm support. Pasha added configure switches for this about a week ago: --en|disable-openib-ibcm --en|disable-openib-rdmacm I like these flags but I thought there was going to be a run time check for cases where Open MPI is built on a system that has ibcm support but is later run on a system without ibcm support. -DON
Re: [OMPI devel] PLM consistency: launch agent param
For something as fundamental as launch do we still need to specify the component, could it just be "launch_agent"? Jeff Squyres wrote: Sounds good to me. We've done similar things in other frameworks -- put in MCA base params for things that all components could use. How about plm_base_launch_agent? On Jul 11, 2008, at 10:17 AM, Ralph H Castain wrote: Since the question of backward compatibility of params came up... ;-) I've been perusing the various PLM modules to check consistency. One thing I noted right away is that -every- PLM module registers an MCA param to let the user specify an orted cmd. I believe this specifically was done so people could insert their favorite debugger in front of the "orted" on the spawned command line - e.g., "valgrind orted". The problem is that this forces the user to have to figure out the name of the PLM module being used as the param is called "-mca plm_rsh_agent", or "-mca plm_lsf_orted", or...you name it. For users that only ever operate in one environment, who cares. However, many users (at least around here) operate in multiple environments, and this creates confusion. I propose to create a single MCA param name for this value - something like "-mca plm_launch_agent" or whatever - and get rid of all these individual registrations to reduce the user confusion. Comments? I'll put my helmet on Ralph ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Open IB BTL and iWARP
Last I looked the OpenIB BTL relied on the short eager rdma buffers being written in order? Is this still the case? If so, how is this handled when iWARP is underneath the User Verb API and not Mellonox IB HCAs?
Re: [OMPI devel] open ib dependency question
capturing in the bug is good enough for me at this point, thanks Jeff Jeff Squyres wrote: Ok: https://svn.open-mpi.org/trac/ompi/ticket/1375 I think any of us could do this -- it's pretty straightforward. No guarantees on when I can get to it; my 1.3 list is already pretty long... On Jul 3, 2008, at 6:20 AM, Pavel Shamis (Pasha) wrote: Jeff Squyres wrote: Do you need configury to disable building ibcm / rdmacm support? The more I think about it, the more I think that these would be good features to have for v1.3... I had similar issue recently. It will be nice to have option to disable/enable *CM via config flags. On Jul 3, 2008, at 2:52 AM, Don Kerr wrote: I did not think it was required but it hung me up when I built ompi on one system which had the ibcm libraries and then ran on a system without the ibcm libs. I had another issue on the system without ibcm libs which prevented my building there but I will go down that path again. Thanks. Jeff Squyres wrote: That is the IBCM library for the IBCM CPC -- IB connection manager stuff. It's not *necessary*; you could use the OOB CPC if you want to. That being said, OMPI currently builds support for it (and links it in) if it finds the right headers and library files. We don't currently have configury to disable this behavior (and *not* build RDMACM and/or IBCM support). Do you have a problem / need to disable building support for IBCM? On Jul 2, 2008, at 12:02 PM, Don Kerr wrote: It appears that the mca_btl_openib.so has a dependency on libibcm.so. Is this necessary? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] open ib dependency question
I did not think it was required but it hung me up when I built ompi on one system which had the ibcm libraries and then ran on a system without the ibcm libs. I had another issue on the system without ibcm libs which prevented my building there but I will go down that path again. Thanks. Jeff Squyres wrote: That is the IBCM library for the IBCM CPC -- IB connection manager stuff. It's not *necessary*; you could use the OOB CPC if you want to. That being said, OMPI currently builds support for it (and links it in) if it finds the right headers and library files. We don't currently have configury to disable this behavior (and *not* build RDMACM and/or IBCM support). Do you have a problem / need to disable building support for IBCM? On Jul 2, 2008, at 12:02 PM, Don Kerr wrote: It appears that the mca_btl_openib.so has a dependency on libibcm.so. Is this necessary? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] open ib dependency question
It appears that the mca_btl_openib.so has a dependency on libibcm.so. Is this necessary?
[OMPI devel] Open MPI Linux Expectations
Can anyone set my expectations with their real world experiences regarding building Open MPI on one release of Linux and running on another. If I were to... Build OMPI on Redhat 4, will it run on later releases of Redhat, e.g. Redhat 5? Build OMPI on Suse 9, will it run on later releases of Suse, e.g. Suse 10? Build OMPI on Redhat, will it run on Suse? Build OMPI on Suse, will it run on Redhat? Thanks in advance for your insights. -DON
Re: [OMPI devel] openib btl build question
Thanks Jeff. Thanks Brian. I ran into this because I was specifically trying to configure with "--disable-progress-threads --disable-mpi-threads" at which point I figured, might as well turn off all threads so I added "--without-threads" as well. But can't live without mpi_leave_pinned so threads are back. Jeff Squyres wrote: On May 21, 2008, at 4:37 PM, Brian W. Barrett wrote: ptmalloc2 is not *required* by the openib btl. But it is required on Linux if you want to use the mpi_leave_pinned functionality. I see one function call to __pthread_initialize in the ptmalloc2 code -- it *looks* like it's a function of glibc, but I don't know for sure. There's actually more than that, it's just buried a bit. There's a whole bunch of thread-specific data stuff, which is wrapped so that different thread packages can be used (although OMPI only supports pthreads). The wrappers are in ptmalloc2/sysdeps/pthreads. Doh! I didn't "grep -r"; my bad...
[OMPI devel] openib btl build question
Just want to make sure what I think I see is true: Linux build. openib btl requires ptmalloc2 and ptmalloc2 requires posix threads, is that correct?
[OMPI devel] btl_openib_iwarp.c : making platform specific calls
I believe btl_open_iwarp.c is making platform specific calls. I don't have jdmason's email and wanted to send this note out before todays con call. btl_openib_iwarp.c #include getifaddrs()
Re: [OMPI devel] 32 bit udapl warnings
This was brought to my attention once before but I don't see this message so I just plain forgot about it. :-( uDAPL defines its pointers as uint64, "typedef DAT_UINT64 DAT_VADDR", and pval is a "void *" which is why the message comes up. If I remove the cast I believe I get a different warning and I just haven't stopped to think of a way around this. Tim Prins wrote: Hi, I am seeing some warnings on the trunk when compiling udapl in 32 bit mode with OFED 1.2.5.1: btl_udapl.c: In function 'udapl_reg_mr': btl_udapl.c:95: warning: cast from pointer to integer of different size btl_udapl.c: In function 'mca_btl_udapl_alloc': btl_udapl.c:852: warning: cast from pointer to integer of different size btl_udapl.c: In function 'mca_btl_udapl_prepare_src': btl_udapl.c:959: warning: cast from pointer to integer of different size btl_udapl.c:1008: warning: cast from pointer to integer of different size btl_udapl_component.c: In function 'mca_btl_udapl_component_progress': btl_udapl_component.c:871: warning: cast from pointer to integer of different size btl_udapl_endpoint.c: In function 'mca_btl_udapl_endpoint_write_eager': btl_udapl_endpoint.c:130: warning: cast from pointer to integer of different size btl_udapl_endpoint.c: In function 'mca_btl_udapl_endpoint_finish_max': btl_udapl_endpoint.c:775: warning: cast from pointer to integer of different size btl_udapl_endpoint.c: In function 'mca_btl_udapl_endpoint_post_recv': btl_udapl_endpoint.c:864: warning: cast from pointer to integer of different size btl_udapl_endpoint.c: In function 'mca_btl_udapl_endpoint_initialize_control_message': btl_udapl_endpoint.c:1012: warning: cast from pointer to integer of different size Thanks, Tim ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] open ib btl and xrc
Those pointers were perfect thanks. It easy to see the benefit of fewer qps (per node instead of per peer) and less consumption of resources the better but I am curious about the actual percentage of memory footprint decrease. I am thinking that the largest portion of the footprint comes from the fragments. Do you have any numbers showing the actual memory footprint savings when using xrc? Just to be clear, I am not asking for you or anyone else to generate these numbers, but if you had them already I would be curious to know the over all savings. -DON Pavel Shamis (Pasha) wrote: Here is paper from openib http://www.openib.org/archives/nov2007sc/XRC.pdf and here is mvapich presentation http://mvapich.cse.ohio-state.edu/publications/ofa_nov07-mvapich-xrc.pdf Button line: XRC decrease number of QPs that ompi opens and as result decrease ompi's memory footprint. In the openib paper you may see more details about XRC. If you need more details about XRC implemention in openib blt , please let me know. Instead Don Kerr wrote: Hi, After searching, about the only thing I can find on xrc is what it stands for, can someone explain the benefits of open mpi's use of xrc, maybe point me to a paper, or both? TIA -DON ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] open ib btl and xrc
Hi, After searching, about the only thing I can find on xrc is what it stands for, can someone explain the benefits of open mpi's use of xrc, maybe point me to a paper, or both? TIA -DON
Re: [OMPI devel] Open IB BTL development question
Thanks Steve, Jeff, Pasha, this is the kind of information I was looking for. -DON Pavel Shamis (Pasha) wrote: I plan to add IB APM support (not something specific to OFED) Don Kerr wrote: Looking at the list of new features for OFED 1.3 and seeing that support for XRC went into the trunk I am curious if support for additional OFED 1.3 features will be included, or plan to be included in Open MPI? I am looking at the list of features here: http://64.233.167.104/search?q=cache:RXXOrY36QHcJ:www.openib.org/archives/nov2007sc/OFED%25201.3%2520status.ppt+ofed+1.3+feature=en=clnk=3=us=firefox-a but I do not have any specific feature in mind, just wanted to get an idea what others are planning. Thanks -DON ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Open IB BTL development question
Looking at the list of new features for OFED 1.3 and seeing that support for XRC went into the trunk I am curious if support for additional OFED 1.3 features will be included, or plan to be included in Open MPI? I am looking at the list of features here: http://64.233.167.104/search?q=cache:RXXOrY36QHcJ:www.openib.org/archives/nov2007sc/OFED%25201.3%2520status.ppt+ofed+1.3+feature=en=clnk=3=us=firefox-a but I do not have any specific feature in mind, just wanted to get an idea what others are planning. Thanks -DON
Re: [OMPI devel] Multi-Rail and Open IB BTL
Jeff Squyres wrote: On Nov 9, 2007, at 1:24 PM, Don Kerr wrote: both, I was thinking of listing what I think are multi-rail requirements but wanted to understand what the current state of things are I believe the OF portion of the FAQ describes what we do in the v1.2 series (right Gleb?); I honestly don't remember what we do today on the trunk (I'm pretty sure that Gleb has tweaked it recently). Gleb's response answered this. As for what we *should* do, it's a very complicated question. :-\ OK. I knew the "close to NIC" was a concern but was not aware an attempt to tackle this began. I will look at the "carto" framework. Thanks -DON This is where all these discussions regarding affinity, NUMA, and NUNA (non uniform network architecture) come into play. A "very simple" scenario may be something like this: - host A is UMA (perhaps even a uniprocessor) with 2 ports that are equidistant from the 1 MPI process on that host - host B is the same, except it only has 1 active port on the same IB subnet as host A's 2 ports - the ports on both hosts are all the same speed (e.g., DDR) - the ports all share a single, common, non-blocking switch But even with this "simple" case, the answer as to what you should do is still unclear. If host A is able to drive both of its DDR links at full speed, you're could cause congestion at the link to host B if the MPI process on host A opens two connections. But if host A is only able to drive the same effective bandwidth out of its two ports as it is through a single port, then the end effect is probably fairly negligible -- it might not make much of a difference at all as to whether the MPI process A opens 1 or 2 connections to host B. But then throw in other effects that I mentioned above (NUMA, NUNA, etc.), and the equation becomes much more complex. In some cases, it may be good to open 1 connection (e.g., bandwidth load balancing); in other cases it may be good to open 2 (e.g., congestion avoidance / spreading traffic around the network, particularly in the presence of other MPI jobs on the network). :-\ Such NUNA architectures may sound unusual to some, but both IBM and HP sell [many] blade-based HPC solutions with NUNA internal IB networks. Specifically: this is a fairly common scenario. So this is a difficult question without a great answer. The hope is that the new carto framework that Sharon sent requirements around for will be able to at least make topology information available from both the host and the network so that BTLs can possibly make some intelligent decisions about what to do in these kinds of scenarios.
Re: [OMPI devel] Multi-Rail and Open IB BTL
both, I was thinking of listing what I think are multi-rail requirements but wanted to understand what the current state of things are Jeff Squyres wrote: Don -- Are you asking what *does* it do, or what *should* a BTL do? On Nov 9, 2007, at 1:09 PM, Don Kerr wrote: Gleb, Another question. What about the case of one node with 2 ports and one node with one port. Does the open ib btl allow the side with 2 ports to establish two endpoints to the single remote port? -DON Gleb Natapov wrote: On Thu, Nov 01, 2007 at 11:15:21AM -0400, Don Kerr wrote: How would the openib btl handle the following scenario: Two nodes, each with two ports, all ports are on the same subnet and switch. Would striping occur over 4 connections or 2? Only two connections will be created. If 2 is it equal distribution or are both local ports connected to the same remote port? Equal distribution. -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Multi-Rail and Open IB BTL
Gleb, Another question. What about the case of one node with 2 ports and one node with one port. Does the open ib btl allow the side with 2 ports to establish two endpoints to the single remote port? -DON Gleb Natapov wrote: On Thu, Nov 01, 2007 at 11:15:21AM -0400, Don Kerr wrote: How would the openib btl handle the following scenario: Two nodes, each with two ports, all ports are on the same subnet and switch. Would striping occur over 4 connections or 2? Only two connections will be created. If 2 is it equal distribution or are both local ports connected to the same remote port? Equal distribution. -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] openib currently broken
Rich, Do the ompi_free_list changes impact the sm btl? Solaris SPARC sm btl seems to have an issue starting with last nights put back but I have not looked into it yet. -DON Richard Graham wrote: R16641 should have fixed the regression. Anyone using ompi_free_list_t_ex() and providing a memory allocator would have been bitten by this, since I did not update this function (which will be deprecated in favor of a version parallel to ompi_free_list_t_new) to initialize the new fields defined. From looking through the btls, this seems to be only the openib btl. Rich On 11/2/07 12:31 PM, "Richard Graham"wrote: On 11/2/07 12:21 PM, "Jeff Squyres" wrote: The freelist changes from yesterday appear to have broken the openib btl. We didn't get lots of test failures in MTT last night only because there was a separate (unrelated) typo in the ofud BTL that prevented the nightly tarball from building on any IB-capable machines. :-) Rich hopes to look into fixing the openib BTL problem today; he thinks it's a case of a simple oversight: the openib BTL is not using the new freelist init functions. Rich: are there other places that are not using the new init functions that need to? the ompi free list has two init functions, I changed just one. The IB btl uses the one I have not yet changed, but the pml uses the one I did change. rich -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Multi-Rail and Open IB BTL
How would the openib btl handle the following scenario: Two nodes, each with two ports, all ports are on the same subnet and switch. Would striping occur over 4 connections or 2? If 2 is it equal distribution or are both local ports connected to the same remote port? Thanks -DON
[OMPI devel] v1.2 branch mpi_preconnect_all
All, I have noticed an issue in the 1.2 branch when mpi_preconnect_all=1. The one way communication pattern (ranks either send or receive from each other) may not fully establish connection with peers. Example, if I have a 3 process mpi job and rank 0 does not do any mpi communication after MPI_Init() the other ranks attempts to connect will not be progressed (I have seen this with tcp and udapl). The preconnect pattern has changed slightly in the trunk but essentially it is still a one way communication, either send or receive with each rank. So although the issue I see in the 1.2 branch does not appear in the trunk I wonder if this will show up again. An alternative to the preconnect pattern that comes to mind would be to perform a send and receive between all ranks to ensure that connections have been fully established. Does anyone have thoughts or comments on this, or reasons not to have all ranks send and receive from all? -DON
Re: [OMPI devel] OpenIB BTL and SRQs
Jeff Squyres wrote: On Jul 12, 2007, at 1:18 PM, Don Kerr wrote: - So if you want to simply eliminate the flow control, choose M high enough (or just a total number of receive buffers to post to the SRQ) that you won't ever run out of resources and you should see some speedup from lack of flow control. This obviously mainly helps apps with lots of small messages; it may not help in many other cases. Is there any distinction by the size of the message. If the "M" parameter is set high does the openib btl post this many recv buffers for the SRQ on both QPs? Or are SRQs only created on one of the QPs? Keep in mind that the SRQs are only for send/receive messages, not RDMA messages. That is obviously enough but isn't there a window for MPI messages that are greater than the eager limit but less than where the rdma protocol kicks in and fragments for this size message use fragments larger than than the eager size. Maybe this is where openib's high and low priority qp differ from udapl which makes a choice of which endpoint to use based on the size of the fragment. That is why I was curious if openib was using SRQs on both queue pairs. Each receive buffer has a max size (the eager limit, IIRC). So if the message is larger than that, we'll fragment per the pipeline protocol, possibly subject to doing RDMA if the message is large enough, yadda yadda yadda. More specifically, the size of the buffer is not dependent upon an individual message that is being sent or received (since they're pre-posted -- we have no idea what the message sizes will be). As for whether the SRQ is on both QP's, this is a Galen/George/Gleb (G^3) question...
Re: [OMPI devel] OpenIB BTL and SRQs
Jeff Squyres wrote: There's a few benefits: - Remember that you post a big pool of buffers instead of num_peers individual sets of receive buffers. Hence, if you post M buffers for each of N peers, each peer -- due to flow control -- can only have M outstanding sends at a time. So if you have apps sending lots of small messages, you can get better utilization of buffer space because a single peer has more than M buffers to receive into. - You can also post less than M*N buffers by playing the statistics of your app -- if you know that you won't have more than M*N messages outstanding at any given time, you can post fewer receive buffers. - At the same time, there's a problem with flow control (meaning that there is none): how can a sender know when they have overflowed the receiver (other than an RNR)? So it's not necessarily as safe. - So if you want to simply eliminate the flow control, choose M high enough (or just a total number of receive buffers to post to the SRQ) that you won't ever run out of resources and you should see some speedup from lack of flow control. This obviously mainly helps apps with lots of small messages; it may not help in many other cases. Is there any distinction by the size of the message. If the "M" parameter is set high does the openib btl post this many recv buffers for the SRQ on both QPs? Or are SRQs only created on one of the QPs? On Jul 12, 2007, at 12:29 PM, Don Kerr wrote: Through mca parameters one can select the use of shared receive queues in the openib btl, other than having fewer queues I am wondering what are the benefits of using this option. Can anyone eleborate on using them vs the default? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] OpenIB BTL and SRQs
Interesting. So with SRQs there is no flow control, I am guessing the btl sets some reasonable default but essentially is relying on the user to adjust other parameters so the buffers are not over run. And yes Galen I would like to read your paper. Jeff Squyres wrote: There's a few benefits: - Remember that you post a big pool of buffers instead of num_peers individual sets of receive buffers. Hence, if you post M buffers for each of N peers, each peer -- due to flow control -- can only have M outstanding sends at a time. So if you have apps sending lots of small messages, you can get better utilization of buffer space because a single peer has more than M buffers to receive into. - You can also post less than M*N buffers by playing the statistics of your app -- if you know that you won't have more than M*N messages outstanding at any given time, you can post fewer receive buffers. - At the same time, there's a problem with flow control (meaning that there is none): how can a sender know when they have overflowed the receiver (other than an RNR)? So it's not necessarily as safe. - So if you want to simply eliminate the flow control, choose M high enough (or just a total number of receive buffers to post to the SRQ) that you won't ever run out of resources and you should see some speedup from lack of flow control. This obviously mainly helps apps with lots of small messages; it may not help in many other cases. On Jul 12, 2007, at 12:29 PM, Don Kerr wrote: Through mca parameters one can select the use of shared receive queues in the openib btl, other than having fewer queues I am wondering what are the benefits of using this option. Can anyone eleborate on using them vs the default? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] OpenIB BTL and SRQs
Through mca parameters one can select the use of shared receive queues in the openib btl, other than having fewer queues I am wondering what are the benefits of using this option. Can anyone eleborate on using them vs the default?
Re: [OMPI devel] opal_output_verbose usage guidelines
Yes I use opal_show_help in other places but that is an all or nothing proposition. I think the ability to be verbose or quiet can be very usefull to end users and that is what I need at the moment. -DON Jeff Squyres wrote: On Jul 9, 2007, at 9:58 AM, Don Kerr wrote: You want a warning to show when: 1. the udapl btl is used 2. --enable-debug was not configured 3. the user specifies btl_*_verbose (or btl_*_debug) >= some_value Is that right? If so, is the intent to warn that somen checks are not being performed that one would otherwise assume are being performed (because of #3)? #1 and #2 is just to convey the environment I expect the user to be running in, not the error case. Interpretation of #3 is a little askew. uDAPL gets its HCA information from /etc/dat.conf. This file has an entry for each HCA, even those that are potentially not "UP". Also it appears the OFED stack includes by default an entry for "OpenIB-bond" which I have not figured out what it is yet. In anycase uDAPL has trouble distinguishing if an HCA is down intentionally or if is down because something is wrong. So the uDAPL BTL attempts to open all of the entries in this file. You might want to ping the OFA general mailing list or the DAT mailing lists with these kinds of questions...? And the issues becomes how much information to toss back to the user. If a node has two IB interfaces but only one is up, do they want see a warning message about one of the interfaces being down when they already know this by looking at "ifconfig"? I think not. But this could be valueable information if there is a real problem. True. FWIW, in the openib btl, we only use HCA ports that are active (i.e., have a link signal and have been recognized/allowed on the network by the SM); we silently ignore those that are not active. We do not currently have a diagnostic that shows which ports are ignored because they are not active, IIRC. Since its just one message at this point I think I will go with the base output_id and if I need more I will look to create a component specific id. Thanks Jeff. FWIW, we always treat the opal_output_verbose output as optional output. If there's something that you definitely want to toss back to the user, use opal_show_help. I expect to pursue this in order to find a better way to distinguish between an interface that is up or down but I don't have a solution at the moment. -DON ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] udapl v1.2.4 merge
Just a heads up. I have merged the uDAPL BTL from the trunk to a tmp repository of v1.2 branch. Can be found in https://svn.open-mpi.org/svn/ompi/tmp/dkerr_udaplv1.2_rdma if anyone is interested in testing before I submit the CMR to bring into 1.2.4. Main goal of CMR: Improve uDAPL BTL performance by adding rdma capabilities to the 1.2 branch. -DON
Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768
It would be difficult for me to attend this afternoon. Tomorrow is much better for me. -DON George Bosilca wrote: I'm available this afternoon. george. On Jun 7, 2007, at 2:35 PM, Galen Shipman wrote: Are people available today to discuss this over the phone? - Galen On Jun 7, 2007, at 11:28 AM, Gleb Natapov wrote: On Thu, Jun 07, 2007 at 11:11:12AM -0400, George Bosilca wrote: ) I expect you to revise the patch in order to propose a generic solution or I'll trigger a vote against the patch. I vote to be backed out of the trunk as it export way to much knowledge from the Open IB BTL into the PML layer. The patch solves real problem. If we want to back it out we need to find another solution. I also didn't like this change too much, but I thought about other solutions and haven't found something better that what Galen did. If you have something in mind lets discuss it. As a general comment this kind of discussion is why I prefer to send significant changes as a patch to the list for discussion before committing. george. PS: With Gleb changes the problem is the same. The following snippet reflect exactly the same behavior as the original patch. I didn't try to change the semantic. Just make the code to match the semantic that Galen described. -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel