Re: [OMPI devel] RFC: Linuxes shipping libibverbs
Brian and I chatted a bit about this off-list, and I think we're in agreement now: - do not change the default value or meaning of btl_base_want_component_unsed. - major point of confusion: the openib BTL is actually fairly unique in that it can (and does) tell the difference between "there are no devices present" and "there are devices, but something went wrong". Other BTL's have network interfaces that can't tell the difference and can *only* call the no_nics function, regardless of whether there are no relevant network interfaces or some error occurred during initialization. - so a reasonable solution would be an openib-BTL-specific mechanism that doesn't call the no_nics function (to display that btl_base_want_component_unused) if there are no verbs-capable devices found because of the fact that mainline Linuxes are starting to ship libibverbs. Specific mechanism TBD; likely to be an openib MCA param. On May 21, 2008, at 9:56 PM, Jeff Squyres wrote: On May 21, 2008, at 5:02 PM, Brian W. Barrett wrote: If this is true (for some reason I thought it wasn't), then I think we'd actually be ok with your proposal, but you're right, you'd need something new in the IB btl. I'm not concerned about the dual rail issue -- if you're smart enough to configure dual rail IB, you're smart enough to figure out OMPI mca params. I'm not sure the same is true for a simple delivered from the white box vendor IB setup that barely works on a good day (and unfortunately, there seems to be evidence that these exist). I'm not sure I understand what you're saying -- you agree, but what "new" do you think we need in the openib BTL? The MCA params saying which ports you expect to be ACTIVE? -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On May 21, 2008, at 5:02 PM, Brian W. Barrett wrote: If this is true (for some reason I thought it wasn't), then I think we'd actually be ok with your proposal, but you're right, you'd need something new in the IB btl. I'm not concerned about the dual rail issue -- if you're smart enough to configure dual rail IB, you're smart enough to figure out OMPI mca params. I'm not sure the same is true for a simple delivered from the white box vendor IB setup that barely works on a good day (and unfortunately, there seems to be evidence that these exist). I'm not sure I understand what you're saying -- you agree, but what "new" do you think we need in the openib BTL? The MCA params saying which ports you expect to be ACTIVE? -- Jeff Squyres Cisco Systems
[OMPI devel] Does Open MPI class exist?
I would dearly like a week-long class on Open MPI - what it is, does, how to build, parameter tweaking, etc. Does anyone know if such a class exists *anywhere* ?
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On Wed, 21 May 2008, Jeff Squyres wrote: I'm only concerned about the case where there's an IB card, the user expects the IB card to be used, and the IB card isn't used. Can you put in a site wide btl = ^tcp to avoid the problem? If the IB card fails, then you'll get unreachable MPI errors. And how many users are going to figure that one out before complaining loudly? That's what LANL did (probably still does) and it worked great there, but that doesn't mean that others will figure that out (after all, not everyone has an OMPI developer on staff...). If the changes don't silence a warning in that situation, I'm fine with whatever you do. But does ibv_get_device_list return an HCA when the port is down (because the SM failed and the machine rebooted since that time)? Yes. If this is true (for some reason I thought it wasn't), then I think we'd actually be ok with your proposal, but you're right, you'd need something new in the IB btl. I'm not concerned about the dual rail issue -- if you're smart enough to configure dual rail IB, you're smart enough to figure out OMPI mca params. I'm not sure the same is true for a simple delivered from the white box vendor IB setup that barely works on a good day (and unfortunately, there seems to be evidence that these exist). Brian
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On May 21, 2008, at 4:29 PM, Brian W. Barrett wrote: Previously, there has not been such a distinction, so I really have no idea which caused the openib BTL throw its error (and never really cared, as it was always somebody else's problem at that point). In the scenarios that I'm talking about, ibv_devinfo(1) and ibv_devices(1) commands should return that there are no devices (you have OFED or equivalent installed but have no verbs-capable hardware): - [15:21] queeg:~/mpi % ibv_devinfo No IB devices found [16:41] queeg:~/mpi % ibv_devices device node GUID -- [16:41] queeg:~/mpi % - Since there's no need for an immediate change to the code base -- perhaps you could watch over the next few weeks and when you see problems of the kind that you're worried about, run ibv_devices and ibv_devinfo. If you see OMPI-reported openfabrics problems with no warnings from libibverbs itself (like I mentioned in my first mail) and ibv_dev* are reporting no devices, then we need to worry about cases where the verbs stack itself doesn't even see the devices (which is a Really Big Error; the OS/driver stack doesn't even see the device). If ibv_dev* reports that there *are* devices when you see the errors that you're worried about, then OMPI would have gotten past this first case and reported something a bit more specific. And therefore is a different warning than the one I'm proposing to remove [by default]. I'm only concerned about the case where there's an IB card, the user expects the IB card to be used, and the IB card isn't used. Can you put in a site wide btl = ^tcp to avoid the problem? If the IB card fails, then you'll get unreachable MPI errors. If the changes don't silence a warning in that situation, I'm fine with whatever you do. But does ibv_get_device_list return an HCA when the port is down (because the SM failed and the machine rebooted since that time)? Yes. If not, we still ahve a (fairly common, unfortunately) error case that we need to report (in my opinion). Agreed. This scenario is already covered by the checking that the openib BTL performs, and I agree that we should not remove this warning. That being said, note that the current error-checking code in the openib BTL only reports if *no* active ports are found on the host. If there are multiple ports in a host where some are active and some are [erroneously] not active, OMPI does not report this (because some real-world users have dual-port HCAs but are only using 1 port). Two options jump to mind: 1. Add yet another MCA param to say "all my ports should be active; warn/error if you find any non-active ports." 2. Add yet another MCA param where ports that *should* be active are itemized. If OMPI finds that any of them are not active, warn/error. #1 could really be a special case of #2 (e.g., a keyword "all"). Both of these options wouldn't be too difficult to do, but we technically are feature frozen... -- Jeff Squyres Cisco Systems
Re: [OMPI devel] openib btl build question
On May 21, 2008, at 4:37 PM, Brian W. Barrett wrote: ptmalloc2 is not *required* by the openib btl. But it is required on Linux if you want to use the mpi_leave_pinned functionality. I see one function call to __pthread_initialize in the ptmalloc2 code -- it *looks* like it's a function of glibc, but I don't know for sure. There's actually more than that, it's just buried a bit. There's a whole bunch of thread-specific data stuff, which is wrapped so that different thread packages can be used (although OMPI only supports pthreads). The wrappers are in ptmalloc2/sysdeps/pthreads. Doh! I didn't "grep -r"; my bad... -- Jeff Squyres Cisco Systems
Re: [OMPI devel] openib btl build question
On Wed, 21 May 2008, Jeff Squyres wrote: On May 21, 2008, at 4:17 PM, Don Kerr wrote: Just want to make sure what I think I see is true: Linux build. openib btl requires ptmalloc2 and ptmalloc2 requires posix threads, is that correct? ptmalloc2 is not *required* by the openib btl. But it is required on Linux if you want to use the mpi_leave_pinned functionality. I see one function call to __pthread_initialize in the ptmalloc2 code -- it *looks* like it's a function of glibc, but I don't know for sure. There's actually more than that, it's just buried a bit. There's a whole bunch of thread-specific data stuff, which is wrapped so that different thread packages can be used (although OMPI only supports pthreads). The wrappers are in ptmalloc2/sysdeps/pthreads. Brian
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On May 21, 2008, at 12:21 PM, Terry Dontje wrote: So are you proposing to set btl_base_warn_component_unused to 0 or something more BTL specific? Probably something more btl-specific. The libibverbs-being-shipped-in- main-line-Linux-distro's issue doesn't really affect other BTLs, so I don't think it's appropriate to make this a global-to-all-of-OMPI issue. But the question of when exactly the openib BTL should call the function that displays that message is up in the air. Brian and I seem to disagree. :-) -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On Wed, 21 May 2008, Jeff Squyres wrote: On May 21, 2008, at 3:38 PM, Jeff Squyres wrote: It would be great if libibverbs could return two different error messages - one for "there's no IB card in this machine" and one for "there's an IB card here, but we can't initialize it". I think that would make this argument go away. Open MPI could probably mimic that behavior by parsing the PCI tables, but that sounds ... painful. Thinking about this a bit more -- I think it depends on what kind of errors you are worried about seeing. IBV does separate the discovery of devices (ibv_get_device_list) from trying to open a device (ibv_open_device). So hypothetically, we *can* distinguish between these kinds of errors already. Do you see devices that are so broken that they don't show up in the list returned from ibv_get_device_list? FWIW: the *only* case I'm talking about changing the default for is when ibv_get_device_list returns an empty list (meaning that according to the verbs stack, there are no devices in the host). I think that we should *always* warn for any kinds of errors that occur after that (e.g., we find a device but can't open it, we find one or more devices but no active ports, etc.). Previously, there has not been such a distinction, so I really have no idea which caused the openib BTL throw its error (and never really cared, as it was always somebody else's problem at that point). I'm only concerned about the case where there's an IB card, the user expects the IB card to be used, and the IB card isn't used. If the changes don't silence a warning in that situation, I'm fine with whatever you do. But does ibv_get_device_list return an HCA when the port is down (because the SM failed and the machine rebooted since that time)? If not, we still ahve a (fairly common, unfortunately) error case that we need to report (in my opinion). Brian
Re: [OMPI devel] openib btl build question
On May 21, 2008, at 4:17 PM, Don Kerr wrote: Just want to make sure what I think I see is true: Linux build. openib btl requires ptmalloc2 and ptmalloc2 requires posix threads, is that correct? ptmalloc2 is not *required* by the openib btl. But it is required on Linux if you want to use the mpi_leave_pinned functionality. I see one function call to __pthread_initialize in the ptmalloc2 code -- it *looks* like it's a function of glibc, but I don't know for sure. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On May 21, 2008, at 3:38 PM, Jeff Squyres wrote: It would be great if libibverbs could return two different error messages - one for "there's no IB card in this machine" and one for "there's an IB card here, but we can't initialize it". I think that would make this argument go away. Open MPI could probably mimic that behavior by parsing the PCI tables, but that sounds ... painful. Thinking about this a bit more -- I think it depends on what kind of errors you are worried about seeing. IBV does separate the discovery of devices (ibv_get_device_list) from trying to open a device (ibv_open_device). So hypothetically, we *can* distinguish between these kinds of errors already. Do you see devices that are so broken that they don't show up in the list returned from ibv_get_device_list? FWIW: the *only* case I'm talking about changing the default for is when ibv_get_device_list returns an empty list (meaning that according to the verbs stack, there are no devices in the host). I think that we should *always* warn for any kinds of errors that occur after that (e.g., we find a device but can't open it, we find one or more devices but no active ports, etc.). -- Jeff Squyres Cisco Systems
[OMPI devel] openib btl build question
Just want to make sure what I think I see is true: Linux build. openib btl requires ptmalloc2 and ptmalloc2 requires posix threads, is that correct?
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
Then we disagree on a core point. I believe that users should never have something silently unexpected happen (like falling back to TCP from a high speed interconnect because of a NIC reset / software issue). YOu clearly don't feel this way. I don't really work on the project, but do have lots of experience being yelled at by users when something unexpected happens. I guarantee you we'll see a report of poor IB / application performance because of the silent fallback to TCP. There's a reason that error message was put in. I don't get a vote anymore, so do whatever you think is best. Brian On Wed, 21 May 2008, Jeff Squyres wrote: One thing I should clarify -- the ibverbs error message from my previous mail is a red herring. libibverbs prints that message on systems where the kernel portions of the OFED stack are not installed (such as the quick-n-dirty test that I did before -- all I did was install libibverbs without the corresponding kernel stuff). I installed the whole OFED stack on a machine with no verbs-capable hardware and verified that the libibverbs message does *not* appear when the kernel bits are properly installed and running. So we're only talking about the Open MPI warning message here. More below. On May 21, 2008, at 12:17 PM, Brian W. Barrett wrote: 2. An out-of-the-box "mpirun a.out" will print warning messages in perfectly valid/good configurations (no verbs-capable hardware, but just happen to have libibverbs installed). This is a Big Deal. Which is easily solved with a better error message, as Pasha suggested. I guess this is where we disagree: I don't believe that the issue is solved by making a "better" message. Specifically: this is the first case where we're saying "if you run with a valid configuration, you're going to get a warning message and you have to do something extra to turn it off." That just seems darn weird to me, especially when other MPI's don't do the same thing. Come to think of it, I can't think of many other software packages that do that. In short: I think it's no longer safe to assume that machines with libibverbs installed must also have verbs-capable hardware. But here's the real problem -- with our current selection logic, a user with libibverbs but no IB cards gets an error message saying "hey, we need you to set this flag to make this error go away" (or would, per Pasha's suggestion). A user with a busted IB stack on a node (which we still saw pretty often at LANL) starts using TCP and their application runs like a dog. I guess it's a matter of how often you see errors in the IB stack that cause nic initialization to fail. The machines I tend to use still exhibit this problem pretty often, but it's possible I just work on bad hardware more often than is usual in the wild. I guess this is the central issue: what *is* the common case? Which set of users should be forced to do something different? I'm claiming that now that the Linux distros are shipping libibverbs, the number of users who have the openib BTL installed but do not have verbs-capable hardware will be *much* larger than those with verbs- capable hardware. Hence, I think the pain point should be for the smaller group (those with verbs-capable hardware): set an MCA param if you want to see the warning message. (we can debate the default value for the BTL-wide base param later -- let's first just debate the *concept* as specific to the openib BTL) It would be great if libibverbs could return two different error messages - one for "there's no IB card in this machine" and one for "there's an IB card here, but we can't initialize it". I think that would make this argument go away. Open MPI could probably mimic that behavior by parsing the PCI tables, but that sounds ... painful. Yes, this capability in libiverbs would be good. Parsing the PCI tables doesn't sound like our role. I'll ask the libibverbs authors about it... I guess the root of my concern is that unexpected behavior with no explanation is (in my mind) the most dangerous case and the one we should address by default. And turning this error message off is going to cause unexpected behavior without explanation. But more information is available, and subject to normal troubleshooting techniques. And if you're in an environment where you *do* want to use verbs-capable hardware, then setting the MCA param seems perfectly acceptable to me. IIRC, LANL sets a whole pile of MCA params in the top-level openmpi-mca-params.conf file that are specific to their environment (right?). If that's true, what's one more param? Heck, the OMPI installed by OFED can set an MCA param in openmpi-mca- params.cof by default (which is what most verbs-capable-hardware-users utilize). That would solve the issue for 98% of the IB/iWARP users out there. Those who compile from source would need to do it manually. I agree that this is less than perfect. My main point is that I really don't like the idea
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
One thing I should clarify -- the ibverbs error message from my previous mail is a red herring. libibverbs prints that message on systems where the kernel portions of the OFED stack are not installed (such as the quick-n-dirty test that I did before -- all I did was install libibverbs without the corresponding kernel stuff). I installed the whole OFED stack on a machine with no verbs-capable hardware and verified that the libibverbs message does *not* appear when the kernel bits are properly installed and running. So we're only talking about the Open MPI warning message here. More below. On May 21, 2008, at 12:17 PM, Brian W. Barrett wrote: 2. An out-of-the-box "mpirun a.out" will print warning messages in perfectly valid/good configurations (no verbs-capable hardware, but just happen to have libibverbs installed). This is a Big Deal. Which is easily solved with a better error message, as Pasha suggested. I guess this is where we disagree: I don't believe that the issue is solved by making a "better" message. Specifically: this is the first case where we're saying "if you run with a valid configuration, you're going to get a warning message and you have to do something extra to turn it off." That just seems darn weird to me, especially when other MPI's don't do the same thing. Come to think of it, I can't think of many other software packages that do that. In short: I think it's no longer safe to assume that machines with libibverbs installed must also have verbs-capable hardware. But here's the real problem -- with our current selection logic, a user with libibverbs but no IB cards gets an error message saying "hey, we need you to set this flag to make this error go away" (or would, per Pasha's suggestion). A user with a busted IB stack on a node (which we still saw pretty often at LANL) starts using TCP and their application runs like a dog. I guess it's a matter of how often you see errors in the IB stack that cause nic initialization to fail. The machines I tend to use still exhibit this problem pretty often, but it's possible I just work on bad hardware more often than is usual in the wild. I guess this is the central issue: what *is* the common case? Which set of users should be forced to do something different? I'm claiming that now that the Linux distros are shipping libibverbs, the number of users who have the openib BTL installed but do not have verbs-capable hardware will be *much* larger than those with verbs- capable hardware. Hence, I think the pain point should be for the smaller group (those with verbs-capable hardware): set an MCA param if you want to see the warning message. (we can debate the default value for the BTL-wide base param later -- let's first just debate the *concept* as specific to the openib BTL) It would be great if libibverbs could return two different error messages - one for "there's no IB card in this machine" and one for "there's an IB card here, but we can't initialize it". I think that would make this argument go away. Open MPI could probably mimic that behavior by parsing the PCI tables, but that sounds ... painful. Yes, this capability in libiverbs would be good. Parsing the PCI tables doesn't sound like our role. I'll ask the libibverbs authors about it... I guess the root of my concern is that unexpected behavior with no explanation is (in my mind) the most dangerous case and the one we should address by default. And turning this error message off is going to cause unexpected behavior without explanation. But more information is available, and subject to normal troubleshooting techniques. And if you're in an environment where you *do* want to use verbs-capable hardware, then setting the MCA param seems perfectly acceptable to me. IIRC, LANL sets a whole pile of MCA params in the top-level openmpi-mca-params.conf file that are specific to their environment (right?). If that's true, what's one more param? Heck, the OMPI installed by OFED can set an MCA param in openmpi-mca- params.cof by default (which is what most verbs-capable-hardware-users utilize). That would solve the issue for 98% of the IB/iWARP users out there. Those who compile from source would need to do it manually. I agree that this is less than perfect. My main point is that I really don't like the idea of "mpirun a.out" will result in warning messages for perfectly valid configurations. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On Wed, May 21, 2008 at 07:41:58PM +0300, Pavel Shamis (Pasha) wrote: > As I know only Openib kernel drivers is installed by default with > distribution. > But the user level - libibverbs and other openib stuff is not installed > by default. User need go to the package manager and explicitly > select libibverb. So if user decided to install libibverbs he had > reasons for it, and I think it will be ok to show this warning. Debian builds with libibverbs because it is available -- we'd be doing a disservice to those who have the hardware if we didn't. Because we build with it, Open MPI packages depend on it. So if you install Open MPI on Debian (and hence Ubuntu and ), you get libibverbs. So yes, suppressing the warning would be great as a default. And generally speaking, we prefer to not diverge from upstream so if you made it a default, we wouldn;t have to differ in what we ship. That's how the thread started. Thanks to all for considering this. Dirk (who as Debian co-maintainer is rather happy with all your good work :) -- Three out of two people have difficulties with fractions.
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
As I know only Openib kernel drivers is installed by default with distribution. But the user level - libibverbs and other openib stuff is not installed by default. User need go to the package manager and explicitly select libibverb. So if user decided to install libibverbs he had reasons for it, and I think it will be ok to show this warning. Pasha. Jeff Squyres wrote: On May 21, 2008, at 11:14 AM, Brian W. Barrett wrote: I think having a parameter to turn off the warning is a great idea. So great in fact, that it already exists in the trunk and v1.2 :)! Setting the default value for the btl_base_warn_component_unused flag from 0 to 1 will have the desired effect. Ah, ok. I either didn't know about this flag or forgot about it. :-) I just tested this myself and see that there are actually *two* error messages (on a machine where I installed libibverbs, but with no OpenFabrics hardware, with OMPI 1.2.6): % mpirun -np 1 hello libibverbs: Fatal: couldn't read uverbs ABI version. -- [0,1,0]: OpenIB on host eddie.osl.iu.edu was unable to find any HCAs. Another transport will be used instead, although this may result in lower performance. -- So the MCA param takes care of the OMPI message; I'll contact the libibverbs authors about their message. I'm not sure I agree with setting the default to 0, however. The warning has proven extremely useful for diagnosing that IB (or less often GM or MX) isn't properly configured on a compute node due to some random error. It's trivially easy for any packaging group to have the line btl_base_warn_component_unused = 0 added to $prefix/etc/openmpi-mca-params.conf during the install phase of the package build (indeed, our simple build scripts at LANL used to do this on a regular bases due to our need to tweek the OOB to keep IPoIB happier at scale). I think keeping the Debian guys happy is a good thing. Giving them an easy way to turn off silly warnings is a good thing. Removing a known useful warning to help them doesn't seem like a good thing. I guess that this is what I am torn about. Yes, it's a useful message -- in some cases. But now that libibverbs is shipping in Debain and other Linuxes, the number of machines out there with verbs-capable hardware is far, far smaller than the number of machines without verbs- capable hardware. Specifically: 1. The number of cases where seeing the message by default is *not* useful is now potentially [much] larger than the number of cases where the default message is useful. 2. An out-of-the-box "mpirun a.out" will print warning messages in perfectly valid/good configurations (no verbs-capable hardware, but just happen to have libibverbs installed). This is a Big Deal. 3. Problems with HCA hardware and/or verbs stack are uncommon (nowadays). I'd be ok asking someone to enable a debug flag to get more information on configuration problems or hardware faults. Shouldn't we be optimizing for the common case? In short: I think it's no longer safe to assume that machines with libibverbs installed must also have verbs-capable hardware.
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
So are you proposing to set btl_base_warn_component_unused to 0 or something more BTL specific? --td Jeff Squyres wrote: What: Change default in openib BTL to not complain if no OpenFabrics devices are found Why: Many linuxes are shipping libibverbs these days, but most users still don't have OpenFabrics hardware Where: btl_openib_component.c When: For v1.3 Timeout: Teleconf, 27 May 2008 Short version = Many major linuxes are shipping libibverbs by default these days. OMPI will therefore build the openib BTL by default, but then complains at run time when there's no OpenFabrics hardware. We should change the default in v1.3 to not complain if there is no OpenFabrics devices found (perhaps have an MCA param to enable the warning if desired). Longer version == I just got a request from the Debian Open MPI package maintainers to include the following in the default openmpi-mca-params.conf for the OMPI v1.2 package: # Disable the use of InfiniBand # btl = ^openib Having this in the openmpi-mca-params.conf gives Debian an easy documentation path for users to shut up these warnings when they build on machines with libibverbs present but no OpenFabrics hardware. I think that this is fine for the v1.2 series (and will file a CMR for it). But for v1.3, I think we should change the default. The vast majority of users will not have OpenFabrics devices, and we should therefore not complain if we can't find any at run-time. We can/should still complain if we find OpenFabrics devices but no active ports (i.e., don't change this behavior). But for optimizing the common case: I think we should (by default) not print a warning if no OpenFabrics devices are found. We can also [easily] have an MCA parameter that *will* display a warning if no OpenFabrics devices are found.
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On Wed, 21 May 2008, Jeff Squyres wrote: 2. An out-of-the-box "mpirun a.out" will print warning messages in perfectly valid/good configurations (no verbs-capable hardware, but just happen to have libibverbs installed). This is a Big Deal. Which is easily solved with a better error message, as Pasha suggested. 3. Problems with HCA hardware and/or verbs stack are uncommon (nowadays). I'd be ok asking someone to enable a debug flag to get more information on configuration problems or hardware faults. Shouldn't we be optimizing for the common case? In short: I think it's no longer safe to assume that machines with libibverbs installed must also have verbs-capable hardware. But here's the real problem -- with our current selection logic, a user with libibverbs but no IB cards gets an error message saying "hey, we need you to set this flag to make this error go away" (or would, per Pasha's suggestion). A user with a busted IB stack on a node (which we still saw pretty often at LANL) starts using TCP and their application runs like a dog. I guess it's a matter of how often you see errors in the IB stack that cause nic initialization to fail. The machines I tend to use still exhibit this problem pretty often, but it's possible I just work on bad hardware more often than is usual in the wild. It would be great if libibverbs could return two different error messages - one for "there's no IB card in this machine" and one for "there's an IB card here, but we can't initialize it". I think that would make this argument go away. Open MPI could probably mimic that behavior by parsing the PCI tables, but that sounds ... painful. I guess the root of my concern is that unexpected behavior with no explanation is (in my mind) the most dangerous case and the one we should address by default. And turning this error message off is going to cause unexpected behavior without explanation. Just my $0.02. Brian
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On May 21, 2008, at 11:14 AM, Brian W. Barrett wrote: I think having a parameter to turn off the warning is a great idea. So great in fact, that it already exists in the trunk and v1.2 :)! Setting the default value for the btl_base_warn_component_unused flag from 0 to 1 will have the desired effect. Ah, ok. I either didn't know about this flag or forgot about it. :-) I just tested this myself and see that there are actually *two* error messages (on a machine where I installed libibverbs, but with no OpenFabrics hardware, with OMPI 1.2.6): % mpirun -np 1 hello libibverbs: Fatal: couldn't read uverbs ABI version. -- [0,1,0]: OpenIB on host eddie.osl.iu.edu was unable to find any HCAs. Another transport will be used instead, although this may result in lower performance. -- So the MCA param takes care of the OMPI message; I'll contact the libibverbs authors about their message. I'm not sure I agree with setting the default to 0, however. The warning has proven extremely useful for diagnosing that IB (or less often GM or MX) isn't properly configured on a compute node due to some random error. It's trivially easy for any packaging group to have the line btl_base_warn_component_unused = 0 added to $prefix/etc/openmpi-mca-params.conf during the install phase of the package build (indeed, our simple build scripts at LANL used to do this on a regular bases due to our need to tweek the OOB to keep IPoIB happier at scale). I think keeping the Debian guys happy is a good thing. Giving them an easy way to turn off silly warnings is a good thing. Removing a known useful warning to help them doesn't seem like a good thing. I guess that this is what I am torn about. Yes, it's a useful message -- in some cases. But now that libibverbs is shipping in Debain and other Linuxes, the number of machines out there with verbs-capable hardware is far, far smaller than the number of machines without verbs- capable hardware. Specifically: 1. The number of cases where seeing the message by default is *not* useful is now potentially [much] larger than the number of cases where the default message is useful. 2. An out-of-the-box "mpirun a.out" will print warning messages in perfectly valid/good configurations (no verbs-capable hardware, but just happen to have libibverbs installed). This is a Big Deal. 3. Problems with HCA hardware and/or verbs stack are uncommon (nowadays). I'd be ok asking someone to enable a debug flag to get more information on configuration problems or hardware faults. Shouldn't we be optimizing for the common case? In short: I think it's no longer safe to assume that machines with libibverbs installed must also have verbs-capable hardware. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
And there's a typo in my first paragraph. The flag currently defaults to 1 (print the warning). It should be switched to 0 to turn off the warning. Sorry for any confusion I might have caused -- I blame the lack of caffeine in the morning. Brian On Wed, 21 May 2008, Pavel Shamis (Pasha) wrote: I'm agree with Brian. We may add to the warning message detailed description how to disable it. Pasha Brian W. Barrett wrote: I think having a parameter to turn off the warning is a great idea. So great in fact, that it already exists in the trunk and v1.2 :)! Setting the default value for the btl_base_warn_component_unused flag from 0 to 1 will have the desired effect. I'm not sure I agree with setting the default to 0, however. The warning has proven extremely useful for diagnosing that IB (or less often GM or MX) isn't properly configured on a compute node due to some random error. It's trivially easy for any packaging group to have the line btl_base_warn_component_unused = 0 added to $prefix/etc/openmpi-mca-params.conf during the install phase of the package build (indeed, our simple build scripts at LANL used to do this on a regular bases due to our need to tweek the OOB to keep IPoIB happier at scale). I think keeping the Debian guys happy is a good thing. Giving them an easy way to turn off silly warnings is a good thing. Removing a known useful warning to help them doesn't seem like a good thing. Brian On Wed, 21 May 2008, Jeff Squyres wrote: What: Change default in openib BTL to not complain if no OpenFabrics devices are found Why: Many linuxes are shipping libibverbs these days, but most users still don't have OpenFabrics hardware Where: btl_openib_component.c When: For v1.3 Timeout: Teleconf, 27 May 2008 Short version = Many major linuxes are shipping libibverbs by default these days. OMPI will therefore build the openib BTL by default, but then complains at run time when there's no OpenFabrics hardware. We should change the default in v1.3 to not complain if there is no OpenFabrics devices found (perhaps have an MCA param to enable the warning if desired). Longer version == I just got a request from the Debian Open MPI package maintainers to include the following in the default openmpi-mca-params.conf for the OMPI v1.2 package: # Disable the use of InfiniBand # btl = ^openib Having this in the openmpi-mca-params.conf gives Debian an easy documentation path for users to shut up these warnings when they build on machines with libibverbs present but no OpenFabrics hardware. I think that this is fine for the v1.2 series (and will file a CMR for it). But for v1.3, I think we should change the default. The vast majority of users will not have OpenFabrics devices, and we should therefore not complain if we can't find any at run-time. We can/should still complain if we find OpenFabrics devices but no active ports (i.e., don't change this behavior). But for optimizing the common case: I think we should (by default) not print a warning if no OpenFabrics devices are found. We can also [easily] have an MCA parameter that *will* display a warning if no OpenFabrics devices are found. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Trunk check-in policy until the branch for 1.3
Thanks, Rich On 5/20/08 10:37 PM, "Brad Benton" wrote: > > > 2008/5/20 Richard Graham : >> Brad, >> Do you want these for bug fixes too ? > > I think that it's okay to check in small bug fixes without a ticket. I know > this is a somewhat nebulous guideline, but I'm thinking bug fixes of a few > lines as being "small". So, unless George has an objection, I'm fine with > that. > > --Brad > > > >> >> >> Rich >> >> >> >> On 5/20/08 5:53 PM, "Brad Benton" wrote: >> >>> All: >>> >>> In order to better track changes on the trunk until we branch for 1.3, we >>> (the release managers) would like to ask that all trunk checkins have >>> corresponding tickets associated with them. This will help us to keep >>> better track of the state of the trunk prior to branching. Note, this is >>> just until we branch, which, hopefully, will be in a few days. >>> >>> The plan is to branch the trunk for 1.3 this Friday evening (May 23). >>> However, depending on the state of the trunk and the final items to get in >>> before the branch, we might decide to delay the branch until the following >>> Tuesday (May 27). George, Jeff & I will discuss this on Friday afternoon >>> and will send out the final plan for branching (Friday or Tuesday) at that >>> time. >>> >>> Thanks, >>> --Brad >>> >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
I'm agree with Brian. We may add to the warning message detailed description how to disable it. Pasha Brian W. Barrett wrote: I think having a parameter to turn off the warning is a great idea. So great in fact, that it already exists in the trunk and v1.2 :)! Setting the default value for the btl_base_warn_component_unused flag from 0 to 1 will have the desired effect. I'm not sure I agree with setting the default to 0, however. The warning has proven extremely useful for diagnosing that IB (or less often GM or MX) isn't properly configured on a compute node due to some random error. It's trivially easy for any packaging group to have the line btl_base_warn_component_unused = 0 added to $prefix/etc/openmpi-mca-params.conf during the install phase of the package build (indeed, our simple build scripts at LANL used to do this on a regular bases due to our need to tweek the OOB to keep IPoIB happier at scale). I think keeping the Debian guys happy is a good thing. Giving them an easy way to turn off silly warnings is a good thing. Removing a known useful warning to help them doesn't seem like a good thing. Brian On Wed, 21 May 2008, Jeff Squyres wrote: What: Change default in openib BTL to not complain if no OpenFabrics devices are found Why: Many linuxes are shipping libibverbs these days, but most users still don't have OpenFabrics hardware Where: btl_openib_component.c When: For v1.3 Timeout: Teleconf, 27 May 2008 Short version = Many major linuxes are shipping libibverbs by default these days. OMPI will therefore build the openib BTL by default, but then complains at run time when there's no OpenFabrics hardware. We should change the default in v1.3 to not complain if there is no OpenFabrics devices found (perhaps have an MCA param to enable the warning if desired). Longer version == I just got a request from the Debian Open MPI package maintainers to include the following in the default openmpi-mca-params.conf for the OMPI v1.2 package: # Disable the use of InfiniBand # btl = ^openib Having this in the openmpi-mca-params.conf gives Debian an easy documentation path for users to shut up these warnings when they build on machines with libibverbs present but no OpenFabrics hardware. I think that this is fine for the v1.2 series (and will file a CMR for it). But for v1.3, I think we should change the default. The vast majority of users will not have OpenFabrics devices, and we should therefore not complain if we can't find any at run-time. We can/should still complain if we find OpenFabrics devices but no active ports (i.e., don't change this behavior). But for optimizing the common case: I think we should (by default) not print a warning if no OpenFabrics devices are found. We can also [easily] have an MCA parameter that *will* display a warning if no OpenFabrics devices are found. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Intel ifort / Libtool 2.x problem
Heads up: Tim switched over the trunk tarballs yesterday to use Libtool 2.2.4, Autoconf 2.62, and Automake 1.10.1. MTT shows that there is a problem with ifort and LT 2.2.4 and Fortran shared libraries on Linux -- LT seems to be dropping the necessary compiler flags to build Fortran shared libraries. I did some more testing and have filed a bug report with our friendly neighborhood Libtool developers. I'll file an OMPI trac ticket when my message arrives in the bug-libtool list web archives. For the time being, if you want to build with the Intel compilers, you must pass -fPIC in FCFLAGS: ./configure CC=icc CXX=icpc FC=ifort F77=ifort FCFLAGS=-fPIC ... -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
I think having a parameter to turn off the warning is a great idea. So great in fact, that it already exists in the trunk and v1.2 :)! Setting the default value for the btl_base_warn_component_unused flag from 0 to 1 will have the desired effect. I'm not sure I agree with setting the default to 0, however. The warning has proven extremely useful for diagnosing that IB (or less often GM or MX) isn't properly configured on a compute node due to some random error. It's trivially easy for any packaging group to have the line btl_base_warn_component_unused = 0 added to $prefix/etc/openmpi-mca-params.conf during the install phase of the package build (indeed, our simple build scripts at LANL used to do this on a regular bases due to our need to tweek the OOB to keep IPoIB happier at scale). I think keeping the Debian guys happy is a good thing. Giving them an easy way to turn off silly warnings is a good thing. Removing a known useful warning to help them doesn't seem like a good thing. Brian On Wed, 21 May 2008, Jeff Squyres wrote: What: Change default in openib BTL to not complain if no OpenFabrics devices are found Why: Many linuxes are shipping libibverbs these days, but most users still don't have OpenFabrics hardware Where: btl_openib_component.c When: For v1.3 Timeout: Teleconf, 27 May 2008 Short version = Many major linuxes are shipping libibverbs by default these days. OMPI will therefore build the openib BTL by default, but then complains at run time when there's no OpenFabrics hardware. We should change the default in v1.3 to not complain if there is no OpenFabrics devices found (perhaps have an MCA param to enable the warning if desired). Longer version == I just got a request from the Debian Open MPI package maintainers to include the following in the default openmpi-mca-params.conf for the OMPI v1.2 package: # Disable the use of InfiniBand # btl = ^openib Having this in the openmpi-mca-params.conf gives Debian an easy documentation path for users to shut up these warnings when they build on machines with libibverbs present but no OpenFabrics hardware. I think that this is fine for the v1.2 series (and will file a CMR for it). But for v1.3, I think we should change the default. The vast majority of users will not have OpenFabrics devices, and we should therefore not complain if we can't find any at run-time. We can/should still complain if we find OpenFabrics devices but no active ports (i.e., don't change this behavior). But for optimizing the common case: I think we should (by default) not print a warning if no OpenFabrics devices are found. We can also [easily] have an MCA parameter that *will* display a warning if no OpenFabrics devices are found.
[OMPI devel] RFC: Linuxes shipping libibverbs
What: Change default in openib BTL to not complain if no OpenFabrics devices are found Why: Many linuxes are shipping libibverbs these days, but most users still don't have OpenFabrics hardware Where: btl_openib_component.c When: For v1.3 Timeout: Teleconf, 27 May 2008 Short version = Many major linuxes are shipping libibverbs by default these days. OMPI will therefore build the openib BTL by default, but then complains at run time when there's no OpenFabrics hardware. We should change the default in v1.3 to not complain if there is no OpenFabrics devices found (perhaps have an MCA param to enable the warning if desired). Longer version == I just got a request from the Debian Open MPI package maintainers to include the following in the default openmpi-mca-params.conf for the OMPI v1.2 package: # Disable the use of InfiniBand # btl = ^openib Having this in the openmpi-mca-params.conf gives Debian an easy documentation path for users to shut up these warnings when they build on machines with libibverbs present but no OpenFabrics hardware. I think that this is fine for the v1.2 series (and will file a CMR for it). But for v1.3, I think we should change the default. The vast majority of users will not have OpenFabrics devices, and we should therefore not complain if we can't find any at run-time. We can/should still complain if we find OpenFabrics devices but no active ports (i.e., don't change this behavior). But for optimizing the common case: I think we should (by default) not print a warning if no OpenFabrics devices are found. We can also [easily] have an MCA parameter that *will* display a warning if no OpenFabrics devices are found. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Threaded progress for CPCs
One more point that Pasha and I hashed out yesterday in IM... To avoid the problem of posting a short handshake buffer to already- existing SRQs, we will only do the extra handshake if there are PPRQ's in receive_queues. The handshake will go across the smallest PPRQ, and represent all QPs in receive_queues (even the SRQs). If there are no PPRQ's in the receive_queues value, we'll just skip the handshake and rely on IB's SRQ RNR retransmitting to fix any race conditions. One point that needs clarification: whether IBCM and RDMACM *require* posting receive buffers on the new QP's. If so, this scheme will run into trouble because we do not want to post any buffers on SRQs; that gets racy and difficult to synchronize right (especially if multiple remote peers are simultaneously trying to connect to a single SRQ). I'll check this out today or tomorrow. We'll have to re-visit this when iWARP NICs start supporting SRQ, but if the above assumption is true (no need to post any receive buffers for IBCM and RDMACM), it will be good enough for v1.3. On May 20, 2008, at 12:37 PM, Jeff Squyres wrote: Ok, I think we're mostly converged on a solution. This might not get implemented immediately (got some other pending v1.3 stuff to bug fix, etc.), but it'll happen for v1.3. - endpoint creation will mpool alloc/register a small buffer for handshake - cpc does not need to call _post_recvs()); instead, it can just post the single small buffer on each BSRQ QP (from the small buffer on the endpoint) - cpc will call _connected() (in the main thread, not the CPC progress thread) when all BSRQ QPs are connected - if _post_recvs() was previously called, do the normal "finish setting up" stuff and declare the endpoint CONNECTED - if _post_recvs() was not previously called, then: - call _post_recvs() - send a short CTS message on the 1st BSRQ QP - wait for CTS from peer - when both CTS from peer has arrived *and* we have sent our CTS, declare endpoint CONNECTED Doing it this way adds no overhead to OOB/XOOB (who don't need this extra handshake). I think the code can be factored nicely to make this not too complicated. I'll work on this once I figure out the memory corruption I'm seeing in the receive_queues patch... Note that this addresses the wireup multi-threading issues -- not iWarp SRQ issues. We'll tackle those separately, and possibly not for the initial v1.3.0 release. On May 20, 2008, at 6:02 AM, Gleb Natapov wrote: On Mon, May 19, 2008 at 01:38:53PM -0400, Jeff Squyres wrote: 5. ...? What about moving posting of receive buffers into main thread. With SRQ it is easy: don't post anything in CPC thread. Main thread will prepost buffers automatically after first fragment received on the endpoint (in btl_openib_handle_incoming()). With PPRQ it's more complicated. What if we'll prepost dummy buffers (not from free list) during IBCM connection stage and will run another three way handshake protocol using those buffers, but from the main thread. We will need to prepost one buffer on the active side and two buffers on the passive side. This is probably the most viable alternative -- it would be easiest if we did this for all CPC's, not just for IBCM: - for PPRQ: CPCs only post a small number of receive buffers, suitable for another handshake that will run in the upper-level openib BTL - for SRQ: CPCs don't post anything (because the SRQ already "belongs" to the upper level openib BTL) Do we have a BSRQ restriction that there *must* be at least one PPRQ? No. We don't have such restriction and I wouldn't want to add it. If so, we could always run the upper-level openib BTL really-post- the- buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e., have the CPC post a single receive on this QP -- see below), which would make things much easier. If we don't already have this restriction, would we mind adding it? We have one PPRQ in our default receive_queues value, anyway. If there is not PPRQ then we can relay on RNR/retransmit logic in case there is not enough buffer in SRQ. We do that anyway in openib BTL code. With this rationale, once the CPC says "ok, all BSRQ QP's are connected", then _endpoint.c can run a CTS handshake to post the "real" buffers, where each side does the following: - CPC calls _endpoint_connected() to tell the upper level BTL that it is fully connected (the function is invoked in the main thread) - _endpoint_connected() posts all the "real" buffers to all the BSRQ QP's on the endpoint - _endpoint_connected() then sends a CTS control message to remote peer via smallest RC PPRQ - upon receipt of CTS: - release the buffer (***) - set endpoint state of CONNECTED and let all pending messages flow... (as it happens today) So it actually doesn't even have to be a handshake -- it's just an additional CTS sent over the newly-created RC QP. Since it's RC, we don't have to do much -- just wait for the CTS to kn
Re: [OMPI devel] get_iwarp_subnet_id in openib btl
Yup. It works. Thanks! With r18470 it works even better! Jon Mason wrote: On Tue, May 20, 2008 at 03:44:41PM -0400, Pak Lui wrote: Hi Jon, This is CentOS 4.6 on Ranger. Sorry I didn't mention it. So what should I do? login3% more /etc/*release* :: /etc/redhat-release :: CentOS release 4.6 (Final) :: /etc/rocks-release :: Rocks release 4.2.1 (Cydonia) login3% Sorry, looks like I busted you. Please pull the latest bits and verify my fix solves your issue. Thanks, Jon Jon Mason wrote: On Tue, May 20, 2008 at 02:48:49PM -0400, Pak Lui wrote: Hi, I am not familiar with get_iwarp_subnet_id and I am not sure why it is causing trunk to barf. I think I am using ofed 1.2.5. See attached for That is in the 1.3 tree, not 1.2. There was a bug in Solaris that was fixed recently that was around this area. Please make sure you are at the latest level. Thanks, Jon config.log. 10439 libtool: link: pgCC -O -DNDEBUG -o .libs/ompi_info components.o ompi_info.o output.o param.o version.o ../../../ompi/.libs/libmpi.so -L/opt/ofed/lib64 -libcm -libverbs -lrt /work/00951/paklui/ompi-trunk7/config-data1/orte/.libs/libopen-rte.so /work/00951/paklui/ompi-trunk7/config-data1/opal/.libs/libopen-pal.so -lnuma -ldl -lnsl -lutil -lpthread -Wl,--rpath -Wl,/work/00951/paklui/ompi-trunk7/shared-install1/lib 10440 ../../../ompi/.libs/libmpi.so: undefined reference to `get_iwarp_subnet_ id' 10441 make[2]: *** [ompi_info] Error 2 10442 make[2]: Leaving directory `/work/00951/paklui/ompi-trunk7/config-data1/ ompi/tools/ompi_info' 10443 make[1]: *** [install-recursive] Error 1 10444 make[1]: Leaving directory `/work/00951/paklui/ompi-trunk7/config-data1/ompi' 10445 make: *** [install-recursive] Error 1 "make.install.log.0" 10445L, 2050037C 10445,1 Bot -- - Pak Lui pak@sun.com ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- - Pak Lui pak@sun.com ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- - Pak Lui pak@sun.com