Re: [OMPI devel] RFC: Linuxes shipping libibverbs
> Is it possible that /sys/class/infiniband directory exist and it is > empty ? In which cases ? Do "modprobe ib_core" on a system with no hardware drivers loaded (or no RDMA hardware installed)
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On May 29, 2008, at 3:27 AM, Pavel Shamis (Pasha) wrote: I got some more feedback from Roland off-list explaining that if / sys/ class/infiniband does exist and is non-empty and /sys/class/ infiniband_verbs/abi_version does not exist, then this is definitely a case where we want to warn because it implies that config is screwed up -- RDMA devices are present but not usable. Is it possible that /sys/class/infiniband directory exist and it is empty ? In which cases ? Roland consistently said "...and not empty" in e-mails to me, so that's what I assumed. However, Pasha just did a test: on a machine with a ConnectX HCA, he manually removed the mlx4 drive and started the openibd service. /sys/ class/infiniband was created, but it was empty. I guess this is a situation that we want to warn about -- we can simplify the whole deal by making the overriding assumption: if the drivers are loaded at all (such that /sys/class/infiniband/ exists at all), OMPI should expect to be able to find some RDMA devices. If it doesn't find any, it should issue a warning. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
I got some more feedback from Roland off-list explaining that if /sys/ class/infiniband does exist and is non-empty and /sys/class/ infiniband_verbs/abi_version does not exist, then this is definitely a case where we want to warn because it implies that config is screwed up -- RDMA devices are present but not usable. Is it possible that /sys/class/infiniband directory exist and it is empty ? In which cases ?
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On May 28, 2008, at 8:02 AM, Jeff Squyres wrote: Note that the two /sys checks may be redundant; I'm not entirely sure how the two files relate to each other. libibverbs will complain about the first if it is not present; the second is used to indicate that the kernel drivers are loaded. I got some more feedback from Roland off-list explaining that if /sys/ class/infiniband does exist and is non-empty and /sys/class/ infiniband_verbs/abi_version does not exist, then this is definitely a case where we want to warn because it implies that config is screwed up -- RDMA devices are present but not usable. In this case, I think the warning that libibverbs itself prints is suitable ("Fatal: couldn't read..."). So let's just eliminate that check in OMPI and go with something like the following (pretty much exactly what was proposed a while ago by Pasha :-) ): # If sysfs/class/infiniband does not exist, the driver was not # started. Therefore: assume that the user does not want RDMA # hardware support -- do *not* print a warning message. if (! -d "$sysfsdir/class/infiniband") { if ($always_want_to_see_warnings) print "Warning: $sysfsdir/class/infiniband does not exist\n"; return SKIP_THIS_BTL; } # If we get to this point, the drivers are loaded and therefore we # will assume that there is supposed to be at least one RDMA device # present. Warn if we don't find any. $list = ibv_get_device_list(); if (empty($list)) { print "Warning: couldn't find any RDMA devices -- if you have no RDMA devices, stop the driver to avoid this warning message\n"; return SKIP_THIS_BTL; } # ...continue with initialization; warnings and errors are # *always* displayed after this point -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
Ok. With lots more off-list discussion, how's this pseudocode for a proposal: # Main assumption: if the kernel drivers are loaded, the user wants RDMA # hardware support in OMPI. $sysfsdir = ibv_get_sysfs_path(); # Avoid printing "Fatal: couldn't read uverbs ABI version" message. if (! -r "$sysfsdir/class/infiniband_verbs/abi_version") { if ($always_want_to_see_warnings) print "Warning: verbs ABI version unreadable\n"; return SKIP_THIS_BTL; } # If sysfs/class/infiniband does not exist, the driver was not started. # Therefore: assume that the user does not want RDMA hardware support -- # do *not* print a warning message. if (! -d "$sysfsdir/class/infiniband") { if ($always_want_to_see_warnings) print "Warning: $sysfsdir/class/infiniband does not exist\n"; return SKIP_THIS_BTL; } # If we get to this point, the drivers are loaded and therefore we will # assume that there is supposed to be at least one RDMA device present. # Warn if we don't find any. $list = ibv_get_device_list(); if (empty($list)) { print "Warning: couldn't find any RDMA devices -- if you have no RDMA devices, stop the driver to avoid this warning message\n"; return SKIP_THIS_BTL; } # ...continue with initialization; warnings and errors are # *always* displayed after this point An overriding assumption here is that if the user requested *only* the openib BTL in OMPI and it fails to find any devices, OMPI will always print an error that it was unable to reach remote MPI peers (regardless of whether the default warning was previously printed or not). Note that the two /sys checks may be redundant; I'm not entirely sure how the two files relate to each other. libibverbs will complain about the first if it is not present; the second is used to indicate that the kernel drivers are loaded. On May 26, 2008, at 5:10 AM, Manuel Prinz wrote: Am Samstag, den 24.05.2008, 17:30 +0200 schrieb Manuel Prinz: Am Donnerstag, den 22.05.2008, 17:18 -0400 schrieb Jeff Squyres: Could you check with some of your other Debian maintainers? I'm sorry that I can't check that before Monday! I'll let you know then but I'm not aware of that. I just checked on a box with no InfiniBand hardware: /dev/infiniband *does not* exist. Loading the IB kernel modules *does not* create the device. I seems like it only exists if the hardware is present. Best regards Manuel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
Am Donnerstag, den 22.05.2008, 17:18 -0400 schrieb Jeff Squyres: > Could you check with some of your other Debian maintainers? I'm sorry that I can't check that before Monday! I'll let you know then but I'm not aware of that. Never thought about the impact of my initial request to Jeff. I personally do not think that it's a problem that warnings are issued, I just think there should be a simple way to help out those users who want to get rid of them and/or help them to understand where the warning comes from in the first place. If you do find a way to suppress them for those who do not have the required hardware this would be indeed nice but it should not be suppressed by default if it complicated debugging or system administration. Just my two cents. The reasoning that libibverbs should only be installed if the hardware exists is somehow valid but kind of impossible to garantee in a distribution. We (the Debian OpenMPI maintainers) could build two OMPI packages (with and without libibverbs dependancy) but this increases the maintainance burden and does not work well with other MPI-using packages. I guess the common case is that libibverbs is pulled in from "somewhere" even if no hardware is present. Best regards Manuel
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
> Either that or udev in not configured properly. Debian has a correct udev configuration, modulo http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=449081 > ib_core/mthca/mlx4 should be loaded automatically by hotplug if HW is > present. No need for any additional configuration. Yes (although only mlx4_core and not mlx4_ib will be loaded based on PCI IDs), but nothing loads ib_uverbs automatically, and systems that have no RDMA hardware will obviously not have any RDMA drivers autoloaded. - R.
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
> OFED is one distribution of the OpenFabrics software. It can be > bundled up and packaged differently, too. I suspect that Debian does > not include OFED directly, because OFED is pretty heavily dependent > upon RPM. So the OpenFabrics kernel bits must be there somewhere > (libibverbs would be useless, otherwise); it would be nice to > understand how they are activated: either manually or automatically. "OpenFabrics kernel bits" doesn't really make sense. Debian just ships a Linux kernel, which has InfiniBand/RDMA drivers. Debian doesn't load the ib_uverbs module by default, nor should it, since the vast majority of users don't have RDMA hardware. So libibverbs and Open MPI should act sanely when no kernel drivers are loaded, /sys/class/infinibad_verbs doesn't exist, etc. There is already a Debian bug open about this for libibverbs: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=418014 I've been meaning to work on this but sadly I have not been able to put much time into it. - R.
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On Fri, May 23, 2008 at 09:56:44AM +0300, Gleb Natapov wrote: > On Thu, May 22, 2008 at 08:30:52PM +, Dirk Eddelbuettel wrote: > > > > Also, if this test depends on the Debian kernel packages, then we're > > > > back to square one as some folks (like myself) run binary kernels, > > > > other may just hand-compile and this test may not work as we may miss > > > > the 'Debian trigger' in those cases. > > > > > > > > > The OpenFabrics kernel drivers are implemented as kernel modules, so > > > it's mainly just a question of loading them it to start them running. > > > For example, in the official OFED distribution, it comes with /etc/ > > > > Do you have any information whether OFED is in fact packaged for > > Debian? It may not be, and hence no file ... > > > AFAIK OFED is not packaged for debian. Ronald packages IB for debian. Correct, that is my understanding too. Good point also re udev. Dirk -- Three out of two people have difficulties with fractions.
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On Thu, May 22, 2008 at 08:30:52PM +, Dirk Eddelbuettel wrote: > > > Also, if this test depends on the Debian kernel packages, then we're > > > back to square one as some folks (like myself) run binary kernels, > > > other may just hand-compile and this test may not work as we may miss > > > the 'Debian trigger' in those cases. > > > > > > The OpenFabrics kernel drivers are implemented as kernel modules, so > > it's mainly just a question of loading them it to start them running. > > For example, in the official OFED distribution, it comes with /etc/ > > Do you have any information whether OFED is in fact packaged for > Debian? It may not be, and hence no file ... > AFAIK OFED is not packaged for debian. Ronald packages IB for debian. -- Gleb.
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On Thu, May 22, 2008 at 04:19:05PM -0400, Jeff Squyres wrote: > On May 22, 2008, at 4:07 PM, Dirk Eddelbuettel wrote: > > > Is there a test I could run for you? > > Can you see if /dev/infiniband exists? If it does, the OpenFabrics > kernel drivers are running. If not, they aren't. Either that or udev in not configured properly. > > > Also, if this test depends on the Debian kernel packages, then we're > > back to square one as some folks (like myself) run binary kernels, > > other may just hand-compile and this test may not work as we may miss > > the 'Debian trigger' in those cases. > > > The OpenFabrics kernel drivers are implemented as kernel modules, so > it's mainly just a question of loading them it to start them running. > For example, in the official OFED distribution, it comes with /etc/ > init.d/openibd -- "start" loads the kernel modules and does all the > necessary initialization, "stop" unloads everything, etc. > ib_core/mthca/mlx4 should be loaded automatically by hotplug if HW is present. No need for any additional configuration. -- Gleb.
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On May 22, 2008, at 4:30 PM, Dirk Eddelbuettel wrote: Can you see if /dev/infiniband exists? If it does, the OpenFabrics kernel drivers are running. If not, they aren't. Negative -- I have no /dev/infiniband. So his test idea seems feasible which is nice! Good! Do you have any information whether OFED is in fact packaged for Debian? It may not be, and hence no file ... OFED is one distribution of the OpenFabrics software. It can be bundled up and packaged differently, too. I suspect that Debian does not include OFED directly, because OFED is pretty heavily dependent upon RPM. So the OpenFabrics kernel bits must be there somewhere (libibverbs would be useless, otherwise); it would be nice to understand how they are activated: either manually or automatically. So if you have this init.d file, perhaps it's a question of checking the chkconfig levels upon installation...? Don't have it, but then again, my personal installation is in no way authorative for all of Debian. Could you check with some of your other Debian maintainers? -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On Thu, May 22, 2008 at 04:19:05PM -0400, Jeff Squyres wrote: > On May 22, 2008, at 4:07 PM, Dirk Eddelbuettel wrote: > > > Is there a test I could run for you? > > Can you see if /dev/infiniband exists? If it does, the OpenFabrics > kernel drivers are running. If not, they aren't. Negative -- I have no /dev/infiniband. So his test idea seems feasible which is nice! > > Also, if this test depends on the Debian kernel packages, then we're > > back to square one as some folks (like myself) run binary kernels, > > other may just hand-compile and this test may not work as we may miss > > the 'Debian trigger' in those cases. > > > The OpenFabrics kernel drivers are implemented as kernel modules, so > it's mainly just a question of loading them it to start them running. > For example, in the official OFED distribution, it comes with /etc/ Do you have any information whether OFED is in fact packaged for Debian? It may not be, and hence no file ... > init.d/openibd -- "start" loads the kernel modules and does all the > necessary initialization, "stop" unloads everything, etc. > > So if you have this init.d file, perhaps it's a question of checking > the chkconfig levels upon installation...? Don't have it, but then again, my personal installation is in no way authorative for all of Debian. Dirk -- Three out of two people have difficulties with fractions.
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On May 22, 2008, at 4:07 PM, Dirk Eddelbuettel wrote: Is there a test I could run for you? Can you see if /dev/infiniband exists? If it does, the OpenFabrics kernel drivers are running. If not, they aren't. Also, if this test depends on the Debian kernel packages, then we're back to square one as some folks (like myself) run binary kernels, other may just hand-compile and this test may not work as we may miss the 'Debian trigger' in those cases. The OpenFabrics kernel drivers are implemented as kernel modules, so it's mainly just a question of loading them it to start them running. For example, in the official OFED distribution, it comes with /etc/ init.d/openibd -- "start" loads the kernel modules and does all the necessary initialization, "stop" unloads everything, etc. So if you have this init.d file, perhaps it's a question of checking the chkconfig levels upon installation...? -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On Thu, May 22, 2008 at 03:45:36PM -0400, Jeff Squyres wrote: > On May 22, 2008, at 3:42 PM, Dirk Eddelbuettel wrote: > > >> When you install binary OMPI (which pulls in libibverbs and all the > >> rest), do you set the OpenFabrics kernel drivers to start upon boot? > >> Or does the user have to do that manually? > > > > I think so. To the best of my knowledge, we don't do anything > > explicitly. > > Can you check? How? Among our approximatly 20,000 binary packages, I only see edd@ron:~$ apt-cache search fabrics libmthca-dev - Development files for the libmthca driver libmthca1 - A userspace driver for Mellanox InfiniBand HCAs libmthca1-dbg - Debugging symbols for the libmthca driver edd@ron:~$ I am not aware of anything kernel-specific happening. That said, I could simply be unaware. Is there a test I could run for you? Also, if this test depends on the Debian kernel packages, then we're back to square one as some folks (like myself) run binary kernels, other may just hand-compile and this test may not work as we may miss the 'Debian trigger' in those cases. > If you (or something else in Debian) start the OpenFabrics drivers > automatically (regardless of whether there are any verbs-capable > devices), then that kinda defeats the point of Pasha's proposed check... Yes, but here I can only offer a meek 'dunno, really'. Sorry! Dirk -- Three out of two people have difficulties with fractions.
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On May 22, 2008, at 3:42 PM, Dirk Eddelbuettel wrote: When you install binary OMPI (which pulls in libibverbs and all the rest), do you set the OpenFabrics kernel drivers to start upon boot? Or does the user have to do that manually? I think so. To the best of my knowledge, we don't do anything explicitly. Can you check? If you (or something else in Debian) start the OpenFabrics drivers automatically (regardless of whether there are any verbs-capable devices), then that kinda defeats the point of Pasha's proposed check... -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On Thu, May 22, 2008 at 03:35:03PM -0400, Jeff Squyres wrote: > Dirk / Debian guys -- > > When you install binary OMPI (which pulls in libibverbs and all the > rest), do you set the OpenFabrics kernel drivers to start upon boot? > Or does the user have to do that manually? I think so. To the best of my knowledge, we don't do anything explicitly. There is really just a Depends: on whatever is needed to run the code. E.g. because we build against libibverbs, the libopenmpi1 library package ends up with Depends: libc6 (>= 2.7-1), libgcc1 (>= 1:4.1.1-21), libgfortran3 (>= 4.3), \ libibverbs1 (>= 1.1), libstdc++6 (>= 4.1.1-21) which is rather standard. > I ask because of the check Pasha proposes: if the user has started the > OpenFabrics kernel drivers, it's ok for OMPI to print warning messages > (this is better than the current: if libibverbs exists, it's ok for > OMPI to print warning messages). Yes, that sounds fine. Hth, Dirk -- Three out of two people have difficulties with fractions.
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
Dirk / Debian guys -- When you install binary OMPI (which pulls in libibverbs and all the rest), do you set the OpenFabrics kernel drivers to start upon boot? Or does the user have to do that manually? I ask because of the check Pasha proposes: if the user has started the OpenFabrics kernel drivers, it's ok for OMPI to print warning messages (this is better than the current: if libibverbs exists, it's ok for OMPI to print warning messages). On May 22, 2008, at 3:25 PM, Pavel Shamis (Pasha) wrote: 1. Driver doesn't support the HCA - If I remember correct , RH40 by default doesn't support ConnectX hca . The device_list will be empty. It is very exotic case. 2. Driver version doesn't correspond with fw version 3. FW was broken 4. Driver was broken and failed to start - it is not very exotic case too. Some times user make some modification - upgrade/install/ etc.. and it brakes driver. In such cases, the ibv_devinfo(1) and ibv_devices(1) commands would show the same error. Yep these utilities will show the same error. Cases 1-2-3 we may cover pretty simple. OPENIB driver creates "/ dev/infiniband" during his startup. So if /dev/infiniband exists and _get_device_list() is empty we may print warning. Ok, that seems reasonable. I don't know how we can cover case 4 :-( If the user makes modifications to the driver and breaks it, I don't think we can be held responsible for that -- prudence declares that you should verify that your [self-modified] driver is not broken first before blaming Open MPI. I'm not that concerned about #4; most of my customers do not modify the drivers. Agree about #4. The check for /dev/infiniband should be simple and I think we can add it to 1.3 . Pasha. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
Brian W. Barrett wrote: With MX, it's one initialization call (mx_init), and it's not clear from the errors it can return that you can differentiate between the two cases. If you run mx_init() on a machine without the MX driver loaded or no NIC detected by the driver, you get a specific error code (MX_NO_DEV) and the default error handler print something like: MX:asterix:mx_init:querying driver:error 5(errno=2):No MX device entry in /dev. You can overload the default error handler to not see the message. Patrick
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On May 22, 2008, at 11:53 AM, Pavel Shamis (Pasha) wrote: 1. Driver doesn't support the HCA - If I remember correct , RH40 by default doesn't support ConnectX hca . The device_list will be empty. It is very exotic case. 2. Driver version doesn't correspond with fw version 3. FW was broken 4. Driver was broken and failed to start - it is not very exotic case too. Some times user make some modification - upgrade/install/ etc.. and it brakes driver. In such cases, the ibv_devinfo(1) and ibv_devices(1) commands would show the same error. Yep these utilities will show the same error. Cases 1-2-3 we may cover pretty simple. OPENIB driver creates "/dev/ infiniband" during his startup. So if /dev/infiniband exists and _get_device_list() is empty we may print warning. Ok, that seems reasonable. I don't know how we can cover case 4 :-( If the user makes modifications to the driver and breaks it, I don't think we can be held responsible for that -- prudence declares that you should verify that your [self-modified] driver is not broken first before blaming Open MPI. I'm not that concerned about #4; most of my customers do not modify the drivers. BTW I think that problem is relevant for all BTLs and not only openib and may be we need look for some global solution. Brian's solution was reasonable; perhaps just adding a flag to the existing no_nics function. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On Thu, May 22, 2008 at 06:53:38PM +0300, Pavel Shamis (Pasha) wrote: > If user will decide to upgrade his ompi + libibverb rpm/deb package > install , he will be need to do a lot of other "annoying" steps, like: > source code download, installing all required *-dev.rpm , compilation. No -- Running "apt-get dist-upgrade", or any of the graphical or console alternates will give you _already compiled_ and _already configured_ packages. That's a point of a distribution. This whole discussion is about how to set up the default configuration (with respect to IB warning which many folks do not have). Also, step back for a second. Many higher-level MPI-using solution 'hide' all this plumbing. So someone advocation a Python-based or R-based MPI solution will just tell people 'install the package'. These user won't know the difference between Open MPI, LAM or MPICH and wouldn't even know where to start. That said, the task is indeed delicate at the Open MPI level as do of course want genuine warnings for genuine failures. It is encouraging how hard everybody tries to help 'the right way' about this. Hth, Dirk -- Three out of two people have difficulties with fractions.
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On Thu, 22 May 2008, Terry Dontje wrote: The major difference here is that libmyriexpress is not being included in mainline Linux distributions. Specifically: if you can find/use libmyriexpress, it's likely because you have that hardware. The same *used* to be true for libibverbs, but is no longer true because Linux distros are now shipping (e.g., the Debian distribution pulls in libibverbs when you install Open MPI). Ok, but there are distributions that do include the myrinet BTL/MTL (ie CT). Though I agree for the most part in the case of myrinet if you have libmyriexpress you probably will probably have an operable interface. I guess I am curious how many other BTLs a distribution might end up delivering that could run into this reporting issue. I guess my point is could this be worth something more general instead of a one off for IB? From my point of view the btl_warn_unused_components coupled with "-mca btl ^mlfbtl" works for me. However the fact that the IB vendors/community (ie CISCO) is solving this for their favorite interface makes me pause for a moment. There's actually a second (in my mind more important) reason why this is IB only, as I shared similar concerns (hence yesterday's e-mail barage). InfiniBand has a two stage initialization -- you get the list of HCAs, then you initialize the HCA you want. So it's possible to determine that there's no HCAs in the system vs. the system couldn't initialize the HCA properly (as that would happen in step 2, according to Jeff). With MX, it's one initialization call (mx_init), and it's not clear from the errors it can return that you can differentiate between the two cases. I haven't tried it, but it's possible that mx_init would succeed in the no nic case, but then have a NIC count of 0. Anyway, the short answer is that (in my opinion) we should have a btl base param similar to warn_unused for whether to warn when no NICs/HCAs are found, hopefully with a nice error function similar to today's no_nics (which probably needs to be renamed in that case). That way, if BTL authors other than OpenIB want to do some extra work and return better error messages, they can. FWIW, our distribution actually turns off btl_base_want_component_unused because it seemed the majority of our cases would be that users would false positive sights of the message. Is the UDAPL library shipped in Solaris by default? If so, then you're likely in exactly the same kind of situation that I'm describing. The same will be true if Solaris ends up shipping libibverbs by default. Yes the UDAPL library is shipped in Solaris by default. Which is why we turn off btl_warn_unused_components. Yes, and I suspect once Solaris starts delivering libibverbs we (Sun) will need to figure out how to handle having both the udapl and openib btls being available. There is some evil configure hackery that could be done to make this work in a more general way (don't you love it when I say that). Autogen/configure makes no guarantees about the order in which the configure.m4 macros for components in the same framework are run, other than all components of priority X are run before those of priority Y, iff X > Y. So you could set the priority of all the components except udapl to (say) 10 and udapl's to 0. Then have the udapl configure only build if 1) it was specifically requested or 2) ompi_check_openib_happy = no. No more Linux-specific stuff, works when Solaris gets OFED, and works on old Solaris that has uDAPL but not OFED. As a matter of fact, it's so trivial to do that I'd recommend doing it for 1.3. Really, you could do it minimally by only changing OpenIB's configure.params to set its priority to 10, uDAPL's configure.params to set its priority to 0, and uDAPL's configure.m4 to remove the Linux stuff and look for ompi_check_openib_happy. Brian
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
Jeff Squyres wrote: On May 22, 2008, at 6:50 AM, Terry Dontje wrote: Brian and I chatted a bit about this off-list, and I think we're in agreement now: - do not change the default value or meaning of btl_base_want_component_unsed. - major point of confusion: the openib BTL is actually fairly unique in that it can (and does) tell the difference between "there are no devices present" and "there are devices, but something went wrong". Other BTL's have network interfaces that can't tell the difference and can *only* call the no_nics function, regardless of whether there are no relevant network interfaces or some error occurred during initialization. - so a reasonable solution would be an openib-BTL-specific mechanism that doesn't call the no_nics function (to display that btl_base_want_component_unused) if there are no verbs-capable devices found because of the fact that mainline Linuxes are starting to ship libibverbs. Specific mechanism TBD; likely to be an openib MCA param. So, if you are delivering something similar to a BTL for myrinet you will see the message but the belief is this is necessary since there isn't enough granularity in the error reporting of the device to feel comfortable enough as to whether the user want the device to be used? The major difference here is that libmyriexpress is not being included in mainline Linux distributions. Specifically: if you can find/use libmyriexpress, it's likely because you have that hardware. The same *used* to be true for libibverbs, but is no longer true because Linux distros are now shipping (e.g., the Debian distribution pulls in libibverbs when you install Open MPI). Ok, but there are distributions that do include the myrinet BTL/MTL (ie CT). Though I agree for the most part in the case of myrinet if you have libmyriexpress you probably will probably have an operable interface. I guess I am curious how many other BTLs a distribution might end up delivering that could run into this reporting issue. I guess my point is could this be worth something more general instead of a one off for IB? From my point of view the btl_warn_unused_components coupled with "-mca btl ^mlfbtl" works for me. However the fact that the IB vendors/community (ie CISCO) is solving this for their favorite interface makes me pause for a moment. Won't udapl have a similar issue here or does it not get built by default when OFED is built? We decided that under Linux, the udapl BTL does not get built by default (even if it could) because then an "mpirun a.out" by default would use both UDAPL and verbs, which is undesirable for several reasons. There's Linux-specific logic to this effect in config/ ompi_check_udapl.m4. Ok, that makes sense. FWIW, our distribution actually turns off btl_base_want_component_unused because it seemed the majority of our cases would be that users would false positive sights of the message. Is the UDAPL library shipped in Solaris by default? If so, then you're likely in exactly the same kind of situation that I'm describing. The same will be true if Solaris ends up shipping libibverbs by default. Yes the UDAPL library is shipped in Solaris by default. Which is why we turn off btl_warn_unused_components. Yes, and I suspect once Solaris starts delivering libibverbs we (Sun) will need to figure out how to handle having both the udapl and openib btls being available. --td
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On May 22, 2008, at 6:50 AM, Terry Dontje wrote: Brian and I chatted a bit about this off-list, and I think we're in agreement now: - do not change the default value or meaning of btl_base_want_component_unsed. - major point of confusion: the openib BTL is actually fairly unique in that it can (and does) tell the difference between "there are no devices present" and "there are devices, but something went wrong". Other BTL's have network interfaces that can't tell the difference and can *only* call the no_nics function, regardless of whether there are no relevant network interfaces or some error occurred during initialization. - so a reasonable solution would be an openib-BTL-specific mechanism that doesn't call the no_nics function (to display that btl_base_want_component_unused) if there are no verbs-capable devices found because of the fact that mainline Linuxes are starting to ship libibverbs. Specific mechanism TBD; likely to be an openib MCA param. So, if you are delivering something similar to a BTL for myrinet you will see the message but the belief is this is necessary since there isn't enough granularity in the error reporting of the device to feel comfortable enough as to whether the user want the device to be used? The major difference here is that libmyriexpress is not being included in mainline Linux distributions. Specifically: if you can find/use libmyriexpress, it's likely because you have that hardware. The same *used* to be true for libibverbs, but is no longer true because Linux distros are now shipping (e.g., the Debian distribution pulls in libibverbs when you install Open MPI). Won't udapl have a similar issue here or does it not get built by default when OFED is built? We decided that under Linux, the udapl BTL does not get built by default (even if it could) because then an "mpirun a.out" by default would use both UDAPL and verbs, which is undesirable for several reasons. There's Linux-specific logic to this effect in config/ ompi_check_udapl.m4. FWIW, our distribution actually turns off btl_base_want_component_unused because it seemed the majority of our cases would be that users would false positive sights of the message. Is the UDAPL library shipped in Solaris by default? If so, then you're likely in exactly the same kind of situation that I'm describing. The same will be true if Solaris ends up shipping libibverbs by default. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
Jeff Squyres wrote: Brian and I chatted a bit about this off-list, and I think we're in agreement now: - do not change the default value or meaning of btl_base_want_component_unsed. - major point of confusion: the openib BTL is actually fairly unique in that it can (and does) tell the difference between "there are no devices present" and "there are devices, but something went wrong". Other BTL's have network interfaces that can't tell the difference and can *only* call the no_nics function, regardless of whether there are no relevant network interfaces or some error occurred during initialization. - so a reasonable solution would be an openib-BTL-specific mechanism that doesn't call the no_nics function (to display that btl_base_want_component_unused) if there are no verbs-capable devices found because of the fact that mainline Linuxes are starting to ship libibverbs. Specific mechanism TBD; likely to be an openib MCA param. So, if you are delivering something similar to a BTL for myrinet you will see the message but the belief is this is necessary since there isn't enough granularity in the error reporting of the device to feel comfortable enough as to whether the user want the device to be used? Won't udapl have a similar issue here or does it not get built by default when OFED is built? FWIW, our distribution actually turns off btl_base_want_component_unused because it seemed the majority of our cases would be that users would false positive sights of the message. --td On May 21, 2008, at 9:56 PM, Jeff Squyres wrote: On May 21, 2008, at 5:02 PM, Brian W. Barrett wrote: If this is true (for some reason I thought it wasn't), then I think we'd actually be ok with your proposal, but you're right, you'd need something new in the IB btl. I'm not concerned about the dual rail issue -- if you're smart enough to configure dual rail IB, you're smart enough to figure out OMPI mca params. I'm not sure the same is true for a simple delivered from the white box vendor IB setup that barely works on a good day (and unfortunately, there seems to be evidence that these exist). I'm not sure I understand what you're saying -- you agree, but what "new" do you think we need in the openib BTL? The MCA params saying which ports you expect to be ACTIVE? -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
Brian and I chatted a bit about this off-list, and I think we're in agreement now: - do not change the default value or meaning of btl_base_want_component_unsed. - major point of confusion: the openib BTL is actually fairly unique in that it can (and does) tell the difference between "there are no devices present" and "there are devices, but something went wrong". Other BTL's have network interfaces that can't tell the difference and can *only* call the no_nics function, regardless of whether there are no relevant network interfaces or some error occurred during initialization. - so a reasonable solution would be an openib-BTL-specific mechanism that doesn't call the no_nics function (to display that btl_base_want_component_unused) if there are no verbs-capable devices found because of the fact that mainline Linuxes are starting to ship libibverbs. Specific mechanism TBD; likely to be an openib MCA param. Ok, we will have own warning mechanism. But we still open question, Will we show (by default) error message in case when libibverbs exists but it is no hca in the hca_list ? I think we should show the error. The problem of libibverbs default install is relevant only for binary distribution, that install all ompi dependences with ompi package. In this case distribution will have openib mca parameter that will allow to disable by default the warning message during ompi package install (or build). I guess that most people still install ompi from sources. And in this case it sound reasonable for me to print this "no hca" warning it openib btl was build. Pasha On May 21, 2008, at 9:56 PM, Jeff Squyres wrote: On May 21, 2008, at 5:02 PM, Brian W. Barrett wrote: If this is true (for some reason I thought it wasn't), then I think we'd actually be ok with your proposal, but you're right, you'd need something new in the IB btl. I'm not concerned about the dual rail issue -- if you're smart enough to configure dual rail IB, you're smart enough to figure out OMPI mca params. I'm not sure the same is true for a simple delivered from the white box vendor IB setup that barely works on a good day (and unfortunately, there seems to be evidence that these exist). I'm not sure I understand what you're saying -- you agree, but what "new" do you think we need in the openib BTL? The MCA params saying which ports you expect to be ACTIVE? -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
Brian and I chatted a bit about this off-list, and I think we're in agreement now: - do not change the default value or meaning of btl_base_want_component_unsed. - major point of confusion: the openib BTL is actually fairly unique in that it can (and does) tell the difference between "there are no devices present" and "there are devices, but something went wrong". Other BTL's have network interfaces that can't tell the difference and can *only* call the no_nics function, regardless of whether there are no relevant network interfaces or some error occurred during initialization. - so a reasonable solution would be an openib-BTL-specific mechanism that doesn't call the no_nics function (to display that btl_base_want_component_unused) if there are no verbs-capable devices found because of the fact that mainline Linuxes are starting to ship libibverbs. Specific mechanism TBD; likely to be an openib MCA param. On May 21, 2008, at 9:56 PM, Jeff Squyres wrote: On May 21, 2008, at 5:02 PM, Brian W. Barrett wrote: If this is true (for some reason I thought it wasn't), then I think we'd actually be ok with your proposal, but you're right, you'd need something new in the IB btl. I'm not concerned about the dual rail issue -- if you're smart enough to configure dual rail IB, you're smart enough to figure out OMPI mca params. I'm not sure the same is true for a simple delivered from the white box vendor IB setup that barely works on a good day (and unfortunately, there seems to be evidence that these exist). I'm not sure I understand what you're saying -- you agree, but what "new" do you think we need in the openib BTL? The MCA params saying which ports you expect to be ACTIVE? -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On May 21, 2008, at 5:02 PM, Brian W. Barrett wrote: If this is true (for some reason I thought it wasn't), then I think we'd actually be ok with your proposal, but you're right, you'd need something new in the IB btl. I'm not concerned about the dual rail issue -- if you're smart enough to configure dual rail IB, you're smart enough to figure out OMPI mca params. I'm not sure the same is true for a simple delivered from the white box vendor IB setup that barely works on a good day (and unfortunately, there seems to be evidence that these exist). I'm not sure I understand what you're saying -- you agree, but what "new" do you think we need in the openib BTL? The MCA params saying which ports you expect to be ACTIVE? -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On Wed, 21 May 2008, Jeff Squyres wrote: I'm only concerned about the case where there's an IB card, the user expects the IB card to be used, and the IB card isn't used. Can you put in a site wide btl = ^tcp to avoid the problem? If the IB card fails, then you'll get unreachable MPI errors. And how many users are going to figure that one out before complaining loudly? That's what LANL did (probably still does) and it worked great there, but that doesn't mean that others will figure that out (after all, not everyone has an OMPI developer on staff...). If the changes don't silence a warning in that situation, I'm fine with whatever you do. But does ibv_get_device_list return an HCA when the port is down (because the SM failed and the machine rebooted since that time)? Yes. If this is true (for some reason I thought it wasn't), then I think we'd actually be ok with your proposal, but you're right, you'd need something new in the IB btl. I'm not concerned about the dual rail issue -- if you're smart enough to configure dual rail IB, you're smart enough to figure out OMPI mca params. I'm not sure the same is true for a simple delivered from the white box vendor IB setup that barely works on a good day (and unfortunately, there seems to be evidence that these exist). Brian
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On May 21, 2008, at 12:21 PM, Terry Dontje wrote: So are you proposing to set btl_base_warn_component_unused to 0 or something more BTL specific? Probably something more btl-specific. The libibverbs-being-shipped-in- main-line-Linux-distro's issue doesn't really affect other BTLs, so I don't think it's appropriate to make this a global-to-all-of-OMPI issue. But the question of when exactly the openib BTL should call the function that displays that message is up in the air. Brian and I seem to disagree. :-) -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On Wed, 21 May 2008, Jeff Squyres wrote: On May 21, 2008, at 3:38 PM, Jeff Squyres wrote: It would be great if libibverbs could return two different error messages - one for "there's no IB card in this machine" and one for "there's an IB card here, but we can't initialize it". I think that would make this argument go away. Open MPI could probably mimic that behavior by parsing the PCI tables, but that sounds ... painful. Thinking about this a bit more -- I think it depends on what kind of errors you are worried about seeing. IBV does separate the discovery of devices (ibv_get_device_list) from trying to open a device (ibv_open_device). So hypothetically, we *can* distinguish between these kinds of errors already. Do you see devices that are so broken that they don't show up in the list returned from ibv_get_device_list? FWIW: the *only* case I'm talking about changing the default for is when ibv_get_device_list returns an empty list (meaning that according to the verbs stack, there are no devices in the host). I think that we should *always* warn for any kinds of errors that occur after that (e.g., we find a device but can't open it, we find one or more devices but no active ports, etc.). Previously, there has not been such a distinction, so I really have no idea which caused the openib BTL throw its error (and never really cared, as it was always somebody else's problem at that point). I'm only concerned about the case where there's an IB card, the user expects the IB card to be used, and the IB card isn't used. If the changes don't silence a warning in that situation, I'm fine with whatever you do. But does ibv_get_device_list return an HCA when the port is down (because the SM failed and the machine rebooted since that time)? If not, we still ahve a (fairly common, unfortunately) error case that we need to report (in my opinion). Brian
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
Then we disagree on a core point. I believe that users should never have something silently unexpected happen (like falling back to TCP from a high speed interconnect because of a NIC reset / software issue). YOu clearly don't feel this way. I don't really work on the project, but do have lots of experience being yelled at by users when something unexpected happens. I guarantee you we'll see a report of poor IB / application performance because of the silent fallback to TCP. There's a reason that error message was put in. I don't get a vote anymore, so do whatever you think is best. Brian On Wed, 21 May 2008, Jeff Squyres wrote: One thing I should clarify -- the ibverbs error message from my previous mail is a red herring. libibverbs prints that message on systems where the kernel portions of the OFED stack are not installed (such as the quick-n-dirty test that I did before -- all I did was install libibverbs without the corresponding kernel stuff). I installed the whole OFED stack on a machine with no verbs-capable hardware and verified that the libibverbs message does *not* appear when the kernel bits are properly installed and running. So we're only talking about the Open MPI warning message here. More below. On May 21, 2008, at 12:17 PM, Brian W. Barrett wrote: 2. An out-of-the-box "mpirun a.out" will print warning messages in perfectly valid/good configurations (no verbs-capable hardware, but just happen to have libibverbs installed). This is a Big Deal. Which is easily solved with a better error message, as Pasha suggested. I guess this is where we disagree: I don't believe that the issue is solved by making a "better" message. Specifically: this is the first case where we're saying "if you run with a valid configuration, you're going to get a warning message and you have to do something extra to turn it off." That just seems darn weird to me, especially when other MPI's don't do the same thing. Come to think of it, I can't think of many other software packages that do that. In short: I think it's no longer safe to assume that machines with libibverbs installed must also have verbs-capable hardware. But here's the real problem -- with our current selection logic, a user with libibverbs but no IB cards gets an error message saying "hey, we need you to set this flag to make this error go away" (or would, per Pasha's suggestion). A user with a busted IB stack on a node (which we still saw pretty often at LANL) starts using TCP and their application runs like a dog. I guess it's a matter of how often you see errors in the IB stack that cause nic initialization to fail. The machines I tend to use still exhibit this problem pretty often, but it's possible I just work on bad hardware more often than is usual in the wild. I guess this is the central issue: what *is* the common case? Which set of users should be forced to do something different? I'm claiming that now that the Linux distros are shipping libibverbs, the number of users who have the openib BTL installed but do not have verbs-capable hardware will be *much* larger than those with verbs- capable hardware. Hence, I think the pain point should be for the smaller group (those with verbs-capable hardware): set an MCA param if you want to see the warning message. (we can debate the default value for the BTL-wide base param later -- let's first just debate the *concept* as specific to the openib BTL) It would be great if libibverbs could return two different error messages - one for "there's no IB card in this machine" and one for "there's an IB card here, but we can't initialize it". I think that would make this argument go away. Open MPI could probably mimic that behavior by parsing the PCI tables, but that sounds ... painful. Yes, this capability in libiverbs would be good. Parsing the PCI tables doesn't sound like our role. I'll ask the libibverbs authors about it... I guess the root of my concern is that unexpected behavior with no explanation is (in my mind) the most dangerous case and the one we should address by default. And turning this error message off is going to cause unexpected behavior without explanation. But more information is available, and subject to normal troubleshooting techniques. And if you're in an environment where you *do* want to use verbs-capable hardware, then setting the MCA param seems perfectly acceptable to me. IIRC, LANL sets a whole pile of MCA params in the top-level openmpi-mca-params.conf file that are specific to their environment (right?). If that's true, what's one more param? Heck, the OMPI installed by OFED can set an MCA param in openmpi-mca- params.cof by default (which is what most verbs-capable-hardware-users utilize). That would solve the issue for 98% of the IB/iWARP users out there. Those who compile from source would need to do it manually. I agree that this is less than perfect. My main point is that I really don't like the idea
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
One thing I should clarify -- the ibverbs error message from my previous mail is a red herring. libibverbs prints that message on systems where the kernel portions of the OFED stack are not installed (such as the quick-n-dirty test that I did before -- all I did was install libibverbs without the corresponding kernel stuff). I installed the whole OFED stack on a machine with no verbs-capable hardware and verified that the libibverbs message does *not* appear when the kernel bits are properly installed and running. So we're only talking about the Open MPI warning message here. More below. On May 21, 2008, at 12:17 PM, Brian W. Barrett wrote: 2. An out-of-the-box "mpirun a.out" will print warning messages in perfectly valid/good configurations (no verbs-capable hardware, but just happen to have libibverbs installed). This is a Big Deal. Which is easily solved with a better error message, as Pasha suggested. I guess this is where we disagree: I don't believe that the issue is solved by making a "better" message. Specifically: this is the first case where we're saying "if you run with a valid configuration, you're going to get a warning message and you have to do something extra to turn it off." That just seems darn weird to me, especially when other MPI's don't do the same thing. Come to think of it, I can't think of many other software packages that do that. In short: I think it's no longer safe to assume that machines with libibverbs installed must also have verbs-capable hardware. But here's the real problem -- with our current selection logic, a user with libibverbs but no IB cards gets an error message saying "hey, we need you to set this flag to make this error go away" (or would, per Pasha's suggestion). A user with a busted IB stack on a node (which we still saw pretty often at LANL) starts using TCP and their application runs like a dog. I guess it's a matter of how often you see errors in the IB stack that cause nic initialization to fail. The machines I tend to use still exhibit this problem pretty often, but it's possible I just work on bad hardware more often than is usual in the wild. I guess this is the central issue: what *is* the common case? Which set of users should be forced to do something different? I'm claiming that now that the Linux distros are shipping libibverbs, the number of users who have the openib BTL installed but do not have verbs-capable hardware will be *much* larger than those with verbs- capable hardware. Hence, I think the pain point should be for the smaller group (those with verbs-capable hardware): set an MCA param if you want to see the warning message. (we can debate the default value for the BTL-wide base param later -- let's first just debate the *concept* as specific to the openib BTL) It would be great if libibverbs could return two different error messages - one for "there's no IB card in this machine" and one for "there's an IB card here, but we can't initialize it". I think that would make this argument go away. Open MPI could probably mimic that behavior by parsing the PCI tables, but that sounds ... painful. Yes, this capability in libiverbs would be good. Parsing the PCI tables doesn't sound like our role. I'll ask the libibverbs authors about it... I guess the root of my concern is that unexpected behavior with no explanation is (in my mind) the most dangerous case and the one we should address by default. And turning this error message off is going to cause unexpected behavior without explanation. But more information is available, and subject to normal troubleshooting techniques. And if you're in an environment where you *do* want to use verbs-capable hardware, then setting the MCA param seems perfectly acceptable to me. IIRC, LANL sets a whole pile of MCA params in the top-level openmpi-mca-params.conf file that are specific to their environment (right?). If that's true, what's one more param? Heck, the OMPI installed by OFED can set an MCA param in openmpi-mca- params.cof by default (which is what most verbs-capable-hardware-users utilize). That would solve the issue for 98% of the IB/iWARP users out there. Those who compile from source would need to do it manually. I agree that this is less than perfect. My main point is that I really don't like the idea of "mpirun a.out" will result in warning messages for perfectly valid configurations. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
As I know only Openib kernel drivers is installed by default with distribution. But the user level - libibverbs and other openib stuff is not installed by default. User need go to the package manager and explicitly select libibverb. So if user decided to install libibverbs he had reasons for it, and I think it will be ok to show this warning. Pasha. Jeff Squyres wrote: On May 21, 2008, at 11:14 AM, Brian W. Barrett wrote: I think having a parameter to turn off the warning is a great idea. So great in fact, that it already exists in the trunk and v1.2 :)! Setting the default value for the btl_base_warn_component_unused flag from 0 to 1 will have the desired effect. Ah, ok. I either didn't know about this flag or forgot about it. :-) I just tested this myself and see that there are actually *two* error messages (on a machine where I installed libibverbs, but with no OpenFabrics hardware, with OMPI 1.2.6): % mpirun -np 1 hello libibverbs: Fatal: couldn't read uverbs ABI version. -- [0,1,0]: OpenIB on host eddie.osl.iu.edu was unable to find any HCAs. Another transport will be used instead, although this may result in lower performance. -- So the MCA param takes care of the OMPI message; I'll contact the libibverbs authors about their message. I'm not sure I agree with setting the default to 0, however. The warning has proven extremely useful for diagnosing that IB (or less often GM or MX) isn't properly configured on a compute node due to some random error. It's trivially easy for any packaging group to have the line btl_base_warn_component_unused = 0 added to $prefix/etc/openmpi-mca-params.conf during the install phase of the package build (indeed, our simple build scripts at LANL used to do this on a regular bases due to our need to tweek the OOB to keep IPoIB happier at scale). I think keeping the Debian guys happy is a good thing. Giving them an easy way to turn off silly warnings is a good thing. Removing a known useful warning to help them doesn't seem like a good thing. I guess that this is what I am torn about. Yes, it's a useful message -- in some cases. But now that libibverbs is shipping in Debain and other Linuxes, the number of machines out there with verbs-capable hardware is far, far smaller than the number of machines without verbs- capable hardware. Specifically: 1. The number of cases where seeing the message by default is *not* useful is now potentially [much] larger than the number of cases where the default message is useful. 2. An out-of-the-box "mpirun a.out" will print warning messages in perfectly valid/good configurations (no verbs-capable hardware, but just happen to have libibverbs installed). This is a Big Deal. 3. Problems with HCA hardware and/or verbs stack are uncommon (nowadays). I'd be ok asking someone to enable a debug flag to get more information on configuration problems or hardware faults. Shouldn't we be optimizing for the common case? In short: I think it's no longer safe to assume that machines with libibverbs installed must also have verbs-capable hardware.
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
So are you proposing to set btl_base_warn_component_unused to 0 or something more BTL specific? --td Jeff Squyres wrote: What: Change default in openib BTL to not complain if no OpenFabrics devices are found Why: Many linuxes are shipping libibverbs these days, but most users still don't have OpenFabrics hardware Where: btl_openib_component.c When: For v1.3 Timeout: Teleconf, 27 May 2008 Short version = Many major linuxes are shipping libibverbs by default these days. OMPI will therefore build the openib BTL by default, but then complains at run time when there's no OpenFabrics hardware. We should change the default in v1.3 to not complain if there is no OpenFabrics devices found (perhaps have an MCA param to enable the warning if desired). Longer version == I just got a request from the Debian Open MPI package maintainers to include the following in the default openmpi-mca-params.conf for the OMPI v1.2 package: # Disable the use of InfiniBand # btl = ^openib Having this in the openmpi-mca-params.conf gives Debian an easy documentation path for users to shut up these warnings when they build on machines with libibverbs present but no OpenFabrics hardware. I think that this is fine for the v1.2 series (and will file a CMR for it). But for v1.3, I think we should change the default. The vast majority of users will not have OpenFabrics devices, and we should therefore not complain if we can't find any at run-time. We can/should still complain if we find OpenFabrics devices but no active ports (i.e., don't change this behavior). But for optimizing the common case: I think we should (by default) not print a warning if no OpenFabrics devices are found. We can also [easily] have an MCA parameter that *will* display a warning if no OpenFabrics devices are found.
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On Wed, 21 May 2008, Jeff Squyres wrote: 2. An out-of-the-box "mpirun a.out" will print warning messages in perfectly valid/good configurations (no verbs-capable hardware, but just happen to have libibverbs installed). This is a Big Deal. Which is easily solved with a better error message, as Pasha suggested. 3. Problems with HCA hardware and/or verbs stack are uncommon (nowadays). I'd be ok asking someone to enable a debug flag to get more information on configuration problems or hardware faults. Shouldn't we be optimizing for the common case? In short: I think it's no longer safe to assume that machines with libibverbs installed must also have verbs-capable hardware. But here's the real problem -- with our current selection logic, a user with libibverbs but no IB cards gets an error message saying "hey, we need you to set this flag to make this error go away" (or would, per Pasha's suggestion). A user with a busted IB stack on a node (which we still saw pretty often at LANL) starts using TCP and their application runs like a dog. I guess it's a matter of how often you see errors in the IB stack that cause nic initialization to fail. The machines I tend to use still exhibit this problem pretty often, but it's possible I just work on bad hardware more often than is usual in the wild. It would be great if libibverbs could return two different error messages - one for "there's no IB card in this machine" and one for "there's an IB card here, but we can't initialize it". I think that would make this argument go away. Open MPI could probably mimic that behavior by parsing the PCI tables, but that sounds ... painful. I guess the root of my concern is that unexpected behavior with no explanation is (in my mind) the most dangerous case and the one we should address by default. And turning this error message off is going to cause unexpected behavior without explanation. Just my $0.02. Brian
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On May 21, 2008, at 11:14 AM, Brian W. Barrett wrote: I think having a parameter to turn off the warning is a great idea. So great in fact, that it already exists in the trunk and v1.2 :)! Setting the default value for the btl_base_warn_component_unused flag from 0 to 1 will have the desired effect. Ah, ok. I either didn't know about this flag or forgot about it. :-) I just tested this myself and see that there are actually *two* error messages (on a machine where I installed libibverbs, but with no OpenFabrics hardware, with OMPI 1.2.6): % mpirun -np 1 hello libibverbs: Fatal: couldn't read uverbs ABI version. -- [0,1,0]: OpenIB on host eddie.osl.iu.edu was unable to find any HCAs. Another transport will be used instead, although this may result in lower performance. -- So the MCA param takes care of the OMPI message; I'll contact the libibverbs authors about their message. I'm not sure I agree with setting the default to 0, however. The warning has proven extremely useful for diagnosing that IB (or less often GM or MX) isn't properly configured on a compute node due to some random error. It's trivially easy for any packaging group to have the line btl_base_warn_component_unused = 0 added to $prefix/etc/openmpi-mca-params.conf during the install phase of the package build (indeed, our simple build scripts at LANL used to do this on a regular bases due to our need to tweek the OOB to keep IPoIB happier at scale). I think keeping the Debian guys happy is a good thing. Giving them an easy way to turn off silly warnings is a good thing. Removing a known useful warning to help them doesn't seem like a good thing. I guess that this is what I am torn about. Yes, it's a useful message -- in some cases. But now that libibverbs is shipping in Debain and other Linuxes, the number of machines out there with verbs-capable hardware is far, far smaller than the number of machines without verbs- capable hardware. Specifically: 1. The number of cases where seeing the message by default is *not* useful is now potentially [much] larger than the number of cases where the default message is useful. 2. An out-of-the-box "mpirun a.out" will print warning messages in perfectly valid/good configurations (no verbs-capable hardware, but just happen to have libibverbs installed). This is a Big Deal. 3. Problems with HCA hardware and/or verbs stack are uncommon (nowadays). I'd be ok asking someone to enable a debug flag to get more information on configuration problems or hardware faults. Shouldn't we be optimizing for the common case? In short: I think it's no longer safe to assume that machines with libibverbs installed must also have verbs-capable hardware. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
And there's a typo in my first paragraph. The flag currently defaults to 1 (print the warning). It should be switched to 0 to turn off the warning. Sorry for any confusion I might have caused -- I blame the lack of caffeine in the morning. Brian On Wed, 21 May 2008, Pavel Shamis (Pasha) wrote: I'm agree with Brian. We may add to the warning message detailed description how to disable it. Pasha Brian W. Barrett wrote: I think having a parameter to turn off the warning is a great idea. So great in fact, that it already exists in the trunk and v1.2 :)! Setting the default value for the btl_base_warn_component_unused flag from 0 to 1 will have the desired effect. I'm not sure I agree with setting the default to 0, however. The warning has proven extremely useful for diagnosing that IB (or less often GM or MX) isn't properly configured on a compute node due to some random error. It's trivially easy for any packaging group to have the line btl_base_warn_component_unused = 0 added to $prefix/etc/openmpi-mca-params.conf during the install phase of the package build (indeed, our simple build scripts at LANL used to do this on a regular bases due to our need to tweek the OOB to keep IPoIB happier at scale). I think keeping the Debian guys happy is a good thing. Giving them an easy way to turn off silly warnings is a good thing. Removing a known useful warning to help them doesn't seem like a good thing. Brian On Wed, 21 May 2008, Jeff Squyres wrote: What: Change default in openib BTL to not complain if no OpenFabrics devices are found Why: Many linuxes are shipping libibverbs these days, but most users still don't have OpenFabrics hardware Where: btl_openib_component.c When: For v1.3 Timeout: Teleconf, 27 May 2008 Short version = Many major linuxes are shipping libibverbs by default these days. OMPI will therefore build the openib BTL by default, but then complains at run time when there's no OpenFabrics hardware. We should change the default in v1.3 to not complain if there is no OpenFabrics devices found (perhaps have an MCA param to enable the warning if desired). Longer version == I just got a request from the Debian Open MPI package maintainers to include the following in the default openmpi-mca-params.conf for the OMPI v1.2 package: # Disable the use of InfiniBand # btl = ^openib Having this in the openmpi-mca-params.conf gives Debian an easy documentation path for users to shut up these warnings when they build on machines with libibverbs present but no OpenFabrics hardware. I think that this is fine for the v1.2 series (and will file a CMR for it). But for v1.3, I think we should change the default. The vast majority of users will not have OpenFabrics devices, and we should therefore not complain if we can't find any at run-time. We can/should still complain if we find OpenFabrics devices but no active ports (i.e., don't change this behavior). But for optimizing the common case: I think we should (by default) not print a warning if no OpenFabrics devices are found. We can also [easily] have an MCA parameter that *will* display a warning if no OpenFabrics devices are found. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel