[OMPI devel] Multi-environment builds
Yo all I have been working on adding/clarifying support for several environments and have encountered a problem that appears to be fairly common out there. Namely, machines that have - over the course of history or for specific reasons - installed libraries to support multiple environments. For example, I can readily find machines that are running TM, but also have LSF and SLURM libraries installed (although those environments are not "active" - the libraries in some cases are old and stale, usually present because either someone wanted to look at them or represent an old installation). The problem is that our Open MPI build system automatically detects the presence of those libraries, builds the corresponding components, and then links those libraries into our system. Unfortunately, this causes two side-effects: 1. we wind up building and loading a bunch of components that we cannot use - which impacts memory footprint; and 2. not every component in every framework runs some library function to determine if that environment is actually active. Hence, our selection logic can sometimes get confused due to conflicting priorities, resulting in the selection of components that cause the system to crash A couple of solutions come immediately to mind: 1. The most obvious one (to me, at least) is to require that people provide "--with-xx" when they build the system. Instead of automatically detecting an include file and library, and then deciding that the existence of those files dictates that we build support for that environment, we would only build support for those environments that the builder specifies, and error out of the build process if multiple conflicting environments are specified. This raises the issue of what to do with rsh, but I think we can handle that one by simply building it wherever possible. 2. We could laboriously go through all the components and ensure that they check in their selection logic to see if that environment is active. This still causes libraries to be loaded for nothing, but keeps the automatic nature of the build system. We would have to deal with those environments that may not have a "safe" function we can call to see if they are "alive", or have old/stale libraries that may have differing behavior in their APIs, but perhaps those are few enough to not be a big problem. Any thoughts on this? It seems like we should solve this as it is becoming more prevalent (at least on the machines I test on). Ralph
Re: [OMPI devel] One-sided operations with Portals
Hi Jeff, Questions regarding HP's contract with SNL can be directed to Debra Leitka, who is the Sandia Contract Representative (SCR). Debra's contact info is: Debra Leitka Phone: 284-8818 Email: dlei...@sandia.gov The work that I will be doing falls under this contract. Thanks, Lisa Glendenning -Original Message- From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres Sent: Monday, July 09, 2007 7:51 AM To: Open MPI Developers Subject: Re: [OMPI devel] One-sided operations with Portals It is probably worth clarifying to find out for sure (i.e., have the appropriate legal representatives investigate to find out who owns the IP). It is an explicit goal of the Open MPI project to have a traceable code pedigree that is properly licensed. Thanks. On Jul 9, 2007, at 9:42 AM, Glendenning, Lisa wrote: > This work would be done under a contract with Sandia National > Laboratories. I believe that makes it SNL's IP. > > > -Original Message- > From: devel-boun...@open-mpi.org [mailto:devel-bounces@open- mpi.org] > On Behalf Of Jeff Squyres > Sent: Friday, July 06, 2007 12:03 AM > To: Open MPI Developers > Subject: Re: [OMPI devel] One-sided operations with Portals > > On Jul 5, 2007, at 11:16 PM, Glendenning, Lisa wrote: > >> Ron Brightwell at SNL has asked me to look into optimizing Open MPI's >> one-sided operations over Portals. Does anyone have any guidance or >> thoughts for this? > > Does this mean that HP is considering joining the Open MPI project? > In order to contribute code, a signed copy of the Open MPI 3rd Party > Contribution agreement must be submitted (see http://www.open-mpi.org/ > community/contribute/). > > -- > Jeff Squyres > Cisco Systems > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] fake rdma flag again?
Hi all - I've finally committed a version of the rdma one-sided component that 1) works and 2) in certain situations actually does rdma. I'll make it the default when the BTLs are used as soon as one last bug is fixed in the DDT engine. However, there is still one outstanding issue. Some BTLs (like Portals or MX) advertise the ability to do a put but place restrictions on the put that only work for OB1. For example, both can only do an RDMA that starts where the prepare_dst() / prepare_src () call said the target buffer was. This isn't a problem for OB1, but kind of defeats the purpose of one-sided ;). There's also a reference count (I believe) in the Portals put/get code that would make life interesting if a descriptor was doing multiple RDMA ops at once. I was thinking that the easy way to solve this was to add a flag (FAKE_RDMA was the current running favorite, since we've used it before for different meaning :) ) to the components that have behaviors that work for OB1, but not a generalized rdma interface. I was wondering what people thought of this idea and if they had any preference for naming the flag. Brian
Re: [OMPI devel] "New" IB vendor and MTU question
On Jul 9, 2007, at 3:17 PM, Peter Kjellstrom wrote: Our new HP cluster has 25208 HCAs (Mellanox Arbel) but a new vendor- id... We have 0x1708 (presumably HP, hardware wise Cisco (Mellanox)) to add to the Added in r15316; thanks for pointing it out. existing list in share/openmpi/mca-btl-openib-hca-params.ini that currently contains: # Mellanox 0x2c9 # Cisco 0x5ad # Silverstorm 0x66a # Voltaire 0x8f1 Somewhat related question 1: Is there a blessed way to map these ids back to strings? Not via C API, no. But the IEEE OUI web page can be used to look up these values: http://standards.ieee.org/regauth/oui/ question 2: Is 1024 really the best MTU for DDR Arbel? I seem to remember this being 2048... I *believe* that that value came from Mellanox, but I don't remember offhand. But it could also be a "doesn't really matter either way" issue. You might want to try both with your apps and see if there's a performance difference. Let us know what happens. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Ob1 segfault
On Mon, Jul 09, 2007 at 10:41:52AM -0400, Tim Prins wrote: > Gleb Natapov wrote: > > On Sun, Jul 08, 2007 at 12:41:58PM -0400, Tim Prins wrote: > >> On Sunday 08 July 2007 08:32:27 am Gleb Natapov wrote: > >>> On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote: > While looking into another problem I ran into an issue which made ob1 > segfault on me. Using gm, and running the test test_dan1 in the onesided > test suite, if I limit the gm freelist by too much, I get a segfault. > That is, > > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 1024 test_dan1 > > works fine, but > > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 512 test_dan1 > >>> I cannot, unfortunately, reproduce this with openib BTL. > >>> > segfaults. Here is the relevant output from gdb: > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 1077541088 (LWP 15600)] > 0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580, > hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267 > 267 MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order, > sizeof(mca_pml_ob1_fin_hdr_t)); > >>> can you send me what's inside bml_btl? > >> It turns out that the order of arguments to mca_pml_ob1_send_fin was > >> wrong. I > >> fixed this in r15304. But now we hang instead of segfault, and have both > >> processes just looping through opal_progress. I really don't what to look > >> for. Any hints? > >> > > Can you look in gdb at mca_pml_ob1.rdma_pending? > Yeah, rank 0 has nothing on the list, and rank 1 has 48 things. Do you run both ranks on the same node? Can you try to run them on different node? -- Gleb.
Re: [OMPI devel] Ob1 segfault
Gleb Natapov wrote: On Sun, Jul 08, 2007 at 12:41:58PM -0400, Tim Prins wrote: On Sunday 08 July 2007 08:32:27 am Gleb Natapov wrote: On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote: While looking into another problem I ran into an issue which made ob1 segfault on me. Using gm, and running the test test_dan1 in the onesided test suite, if I limit the gm freelist by too much, I get a segfault. That is, mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 1024 test_dan1 works fine, but mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 512 test_dan1 I cannot, unfortunately, reproduce this with openib BTL. segfaults. Here is the relevant output from gdb: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1077541088 (LWP 15600)] 0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580, hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267 267 MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order, sizeof(mca_pml_ob1_fin_hdr_t)); can you send me what's inside bml_btl? It turns out that the order of arguments to mca_pml_ob1_send_fin was wrong. I fixed this in r15304. But now we hang instead of segfault, and have both processes just looping through opal_progress. I really don't what to look for. Any hints? Can you look in gdb at mca_pml_ob1.rdma_pending? Yeah, rank 0 has nothing on the list, and rank 1 has 48 things. Here is the first item on the list: $7 = { super = { super = { super = { obj_magic_id = 16046253926196952813, obj_class = 0x404f5980, obj_reference_count = 1, cls_init_file_name = 0x404f30f9 "pml_ob1_sendreq.c", cls_init_lineno = 1134 }, opal_list_next = 0x8f5d680, opal_list_prev = 0x404f57c8, opal_list_item_refcount = 1, opal_list_item_belong_to = 0x404f57b0 }, registration = 0x0, ptr = 0x0 }, rdma_bml = 0x8729098, rdma_hdr = { hdr_common = { hdr_type = 8 '\b', hdr_flags = 4 '\004' }, hdr_match = { hdr_common = { hdr_type = 8 '\b', hdr_flags = 4 '\004' }, hdr_ctx = 5, hdr_src = 1, hdr_tag = 142418176, hdr_seq = 0, hdr_padding = "\000" }, hdr_rndv = { hdr_match = { hdr_common = { hdr_type = 8 '\b', hdr_flags = 4 '\004' }, hdr_ctx = 5, hdr_src = 1, hdr_tag = 142418176, hdr_seq = 0, hdr_padding = "\000" }, hdr_msg_length = 236982400, hdr_src_req = { lval = 0, ival = 0, pval = 0x0, sval = { uval = 0, lval = 0 } } }, hdr_rget = { hdr_rndv = { hdr_match = { hdr_common = { hdr_type = 8 '\b', hdr_flags = 4 '\004' }, hdr_ctx = 5, hdr_src = 1, hdr_tag = 142418176, hdr_seq = 0, hdr_padding = "\000" }, hdr_msg_length = 236982400, hdr_src_req = { lval = 0, ival = 0, pval = 0x0, sval = { uval = 0, lval = 0 } } }, hdr_seg_cnt = 1106481152, hdr_padding = "\000\000\000", hdr_des = { lval = 32768, ival = 32768, pval = 0x8000, sval = { uval = 32768, lval = 0 } }, hdr_segs = {{ seg_addr = { lval = 0, ival = 0, pval = 0x0, sval = { uval = 0, lval = 0 } }, seg_len = 0, seg_padding = "\000\000\000", seg_key = { key32 = {0, 0}, key64 = 0, key8 = "\000\000\000\000\000\000\000" } }} }, hdr_frag = { hdr_common = { hdr_type = 8 '\b', hdr_flags = 4 '\004' }, hdr_padding = "\005\000\001\000\000", hdr_frag_offset = 142418176, hdr_src_req = { lval = 236982400, ival = 236982400, pval = 0xe201080, sval = { uval = 236982400, lval = 0 } }, hdr_dst_req = { lval = 0, ival = 0, pval = 0x0, sval = { uval = 0, lval = 0 } } }, hdr_ack = { hdr_common = { hdr_type = 8 '\b', hdr_flags = 4 '\004' }, hdr_padding = "\005\000\001\000\000", hdr_src_req = { lval = 142418176, ival = 142418176, pval = 0x87d2100, sval = { uval = 142418176, lval = 0 } }, hdr_dst_req = { lval = 236982400, ival = 236982400, pval = 0xe201080, sval = { uval = 236982400, lval = 0 } }, hdr_send_offset = 0 }, hdr_rdma = { hdr_common = { hdr_type = 8
Re: [OMPI devel] opal_output_verbose usage guidelines
Yes I use opal_show_help in other places but that is an all or nothing proposition. I think the ability to be verbose or quiet can be very usefull to end users and that is what I need at the moment. -DON Jeff Squyres wrote: On Jul 9, 2007, at 9:58 AM, Don Kerr wrote: You want a warning to show when: 1. the udapl btl is used 2. --enable-debug was not configured 3. the user specifies btl_*_verbose (or btl_*_debug) >= some_value Is that right? If so, is the intent to warn that somen checks are not being performed that one would otherwise assume are being performed (because of #3)? #1 and #2 is just to convey the environment I expect the user to be running in, not the error case. Interpretation of #3 is a little askew. uDAPL gets its HCA information from /etc/dat.conf. This file has an entry for each HCA, even those that are potentially not "UP". Also it appears the OFED stack includes by default an entry for "OpenIB-bond" which I have not figured out what it is yet. In anycase uDAPL has trouble distinguishing if an HCA is down intentionally or if is down because something is wrong. So the uDAPL BTL attempts to open all of the entries in this file. You might want to ping the OFA general mailing list or the DAT mailing lists with these kinds of questions...? And the issues becomes how much information to toss back to the user. If a node has two IB interfaces but only one is up, do they want see a warning message about one of the interfaces being down when they already know this by looking at "ifconfig"? I think not. But this could be valueable information if there is a real problem. True. FWIW, in the openib btl, we only use HCA ports that are active (i.e., have a link signal and have been recognized/allowed on the network by the SM); we silently ignore those that are not active. We do not currently have a diagnostic that shows which ports are ignored because they are not active, IIRC. Since its just one message at this point I think I will go with the base output_id and if I need more I will look to create a component specific id. Thanks Jeff. FWIW, we always treat the opal_output_verbose output as optional output. If there's something that you definitely want to toss back to the user, use opal_show_help. I expect to pursue this in order to find a better way to distinguish between an interface that is up or down but I don't have a solution at the moment. -DON ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Ob1 segfault
On Sun, Jul 08, 2007 at 12:41:58PM -0400, Tim Prins wrote: > On Sunday 08 July 2007 08:32:27 am Gleb Natapov wrote: > > On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote: > > > While looking into another problem I ran into an issue which made ob1 > > > segfault on me. Using gm, and running the test test_dan1 in the onesided > > > test suite, if I limit the gm freelist by too much, I get a segfault. > > > That is, > > > > > > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 1024 test_dan1 > > > > > > works fine, but > > > > > > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 512 test_dan1 > > > > I cannot, unfortunately, reproduce this with openib BTL. > > > > > segfaults. Here is the relevant output from gdb: > > > > > > Program received signal SIGSEGV, Segmentation fault. > > > [Switching to Thread 1077541088 (LWP 15600)] > > > 0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580, > > > hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267 > > > 267 MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order, > > > sizeof(mca_pml_ob1_fin_hdr_t)); > > > > can you send me what's inside bml_btl? > > It turns out that the order of arguments to mca_pml_ob1_send_fin was wrong. I > fixed this in r15304. But now we hang instead of segfault, and have both > processes just looping through opal_progress. I really don't what to look > for. Any hints? > Can you look in gdb at mca_pml_ob1.rdma_pending? -- Gleb.
Re: [OMPI devel] One-sided operations with Portals
This work would be done under a contract with Sandia National Laboratories. I believe that makes it SNL's IP. -Original Message- From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres Sent: Friday, July 06, 2007 12:03 AM To: Open MPI Developers Subject: Re: [OMPI devel] One-sided operations with Portals On Jul 5, 2007, at 11:16 PM, Glendenning, Lisa wrote: > Ron Brightwell at SNL has asked me to look into optimizing Open MPI's > one-sided operations over Portals. Does anyone have any guidance or > thoughts for this? Does this mean that HP is considering joining the Open MPI project? In order to contribute code, a signed copy of the Open MPI 3rd Party Contribution agreement must be submitted (see http://www.open-mpi.org/ community/contribute/). -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] opal_output_verbose usage guidelines
On Jul 6, 2007, at 5:20 PM, Don Kerr wrote: Are there any guidelines about the use of opal_output_verbose? Not so much. - Are there hidden meanings for a given verbose level? e.g. 0 reserved for PML, or 50-100 for BTL and so on Nope. The output was designed to use the values with >= kinds of checking; i.e., the higher the verbose value the user gives, the more output they see. I.e., the values are not used in a "bit flag" sense (i.e., each bit enables/disables a specific set of output). - Maybe the base component output_id is ok to use in situation XYZ but a component specific output_id should be used in situation ABC? Or should never be used for component specific output? I've typically used the base component output_id whenever possible. I usually started off having an output ID for a specific component, but usually that was for debugging (and therefore having oodles and oodles of output). By the time I was done, I usually had only a few output statements and therefore used the base ID. I guess my suggestion would be: if you're going to have a LOT of output, then make it a component-specific ID. If it's a "reasonable" amount, then just use the base ID. Definitions of those terms are subjective and intentionally fuzzy. :-) Why I ask. I want to report a warning to the user when "--enable- debug" is not configured. I also do not want the error to show up all the time, only when for example --mca btl_base_debug is set to some value. I am thinking I will just use opal_output_verbose but wanted to see if there were any guidelines about its use? Or if I should be thinking about some other option all together. You want a warning to show when: 1. the udapl btl is used 2. --enable-debug was not configured 3. the user specifies btl_*_verbose (or btl_*_debug) >= some_value Is that right? If so, is the intent to warn that somen checks are not being performed that one would otherwise assume are being performed (because of #3)? -- Jeff Squyres Cisco Systems