On Apr 17, 2016, at 3:24 PM, dpchoudh . <dpcho...@gmail.com> wrote:
> 
> Hello all
> 
> As I understand, the openib BTL supports NIC failover, but I am confused 
> about the scope of this support. Let me elaborate:
> 
> 1. Is the failover support part of MPI specification?

No.  MPI doesn't make many statements about failures.

> 2. Is it an openMPI-specific addition to MPI implementation?

I believe you're referring to the bfo PML, which could effect failover if a 
given BTL module fails.  bfo was originally a fork from ob1, but it has not 
kept up with ob1 for quite a long time -- it may be way out of date.

> 3. Is it a verb-API specification?

IIRC, it works simply by detecting failures from the verbs API return codes.  
If it detects that a verbs device fails, it simply shuts it down and lets bfo 
use another, non-failed verbs device.

> Since the openib BTL uses only verbs APIs under the hood, it should work on 
> any NIC (e,g. iWARP or RoCE) that support verbs, isn't it?

Hypothetically.

> 4. Is it an Infiniband specification? Going through my old mail archive, it 
> seems that openMPI 1.2 release supported this without relying on the IB 
> automatic path migration functionality, so it seems to me that what openMPI 
> provides is independent of IB APM (that plus the openib BTL runs on link 
> types other than Infiniband)
> 
> 4.1 If it is based on infiniband APM, is this available if I chose to run a 
> MTL (e.g. PSM) instead of the openib BTL?

I don't know if PSM supports the concept of APM.

> 5. If my understanding is correct on point #4 above (i.e. the openMPI 
> failover is independent of any link specific capability of Infiniband), then 
> why is a similar functionality not provided for other transport type? The 
> only non-verb transport that I currently have access to is TCP, and I don't 
> think the TCP transport provides auto-failover.

I believe that when Rolf implemented bfo, there were some tradeoffs and/or 
uncertainties about how the presence of failover capability would impact 
performance.  Hence, he forked ob1 and added his failover support stuff in bfo.

I think there were some other requirements needed at the BTL layer, too, and I 
think Rolf only implemented those in the openib BTL.

Failover has not been a highly-requested feature.  Given that it was developed 
once and then effectively abandoned, it simply hasn't been kept up.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to