On Apr 17, 2016, at 3:24 PM, dpchoudh . <dpcho...@gmail.com> wrote: > > Hello all > > As I understand, the openib BTL supports NIC failover, but I am confused > about the scope of this support. Let me elaborate: > > 1. Is the failover support part of MPI specification?
No. MPI doesn't make many statements about failures. > 2. Is it an openMPI-specific addition to MPI implementation? I believe you're referring to the bfo PML, which could effect failover if a given BTL module fails. bfo was originally a fork from ob1, but it has not kept up with ob1 for quite a long time -- it may be way out of date. > 3. Is it a verb-API specification? IIRC, it works simply by detecting failures from the verbs API return codes. If it detects that a verbs device fails, it simply shuts it down and lets bfo use another, non-failed verbs device. > Since the openib BTL uses only verbs APIs under the hood, it should work on > any NIC (e,g. iWARP or RoCE) that support verbs, isn't it? Hypothetically. > 4. Is it an Infiniband specification? Going through my old mail archive, it > seems that openMPI 1.2 release supported this without relying on the IB > automatic path migration functionality, so it seems to me that what openMPI > provides is independent of IB APM (that plus the openib BTL runs on link > types other than Infiniband) > > 4.1 If it is based on infiniband APM, is this available if I chose to run a > MTL (e.g. PSM) instead of the openib BTL? I don't know if PSM supports the concept of APM. > 5. If my understanding is correct on point #4 above (i.e. the openMPI > failover is independent of any link specific capability of Infiniband), then > why is a similar functionality not provided for other transport type? The > only non-verb transport that I currently have access to is TCP, and I don't > think the TCP transport provides auto-failover. I believe that when Rolf implemented bfo, there were some tradeoffs and/or uncertainties about how the presence of failover capability would impact performance. Hence, he forked ob1 and added his failover support stuff in bfo. I think there were some other requirements needed at the BTL layer, too, and I think Rolf only implemented those in the openib BTL. Failover has not been a highly-requested feature. Given that it was developed once and then effectively abandoned, it simply hasn't been kept up. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/