The root cause is that the nodes are defined as “heterogeneous” because the 
difference in HCAs causes a difference in selection logic. For scalability 
purposes, we don’t circulate the choice of PML as that isn’t something mpirun 
can “discover” and communicate.

One option we could pursue is to provide a mechanism by which we add the HCAs 
to the topology “signature” sent back by the daemon. This would allow us to 
detect the difference, and then ensure that the PML selection is included in 
the circulated wireup data so the system can at least warn you of the problem 
instead of silently hanging.


> On Feb 28, 2017, at 10:38 AM, Orion Poplawski <or...@cora.nwra.com> wrote:
> 
> On 02/27/2017 05:19 PM, Howard Pritchard wrote:
>> Hi Orion
>> 
>> Does the problem occur if you only use font2 and 3?  Do you have MXM 
>> installed
>> on the font1 node?
> 
> No, running across font2/3 is fine.  No idea what MXM is.
> 
>> The 2.x series is using PMIX and it could be that is impacting the PML sanity
>> check.
>> 
>> Howard
>> 
>> 
>> Orion Poplawski <or...@cora.nwra.com <mailto:or...@cora.nwra.com>> schrieb am
>> Mo. 27. Feb. 2017 um 14:50:
>> 
>>    We have a couple nodes with different IB adapters in them:
>> 
>>    font1/var/log/lspci:03:00.0 InfiniBand [0c06]: Mellanox Technologies 
>> MT25204
>>    [InfiniHost III Lx HCA] [15b3:6274] (rev 20)
>>    font2/var/log/lspci:03:00.0 InfiniBand [0c06]: QLogic Corp. IBA7220 
>> InfiniBand
>>    HCA [1077:7220] (rev 02)
>>    font3/var/log/lspci:03:00.0 InfiniBand [0c06]: QLogic Corp. IBA7220 
>> InfiniBand
>>    HCA [1077:7220] (rev 02)
>> 
>>    With 1.10.3 we saw the following errors with mpirun:
>> 
>>    [font2.cora.nwra.com:13982 <http://font2.cora.nwra.com:13982>]
>>    [[23220,1],10] selected pml cm, but peer
>>    [[23220,1],0] on font1 selected pml ob1
>> 
>>    which crashed MPI_Init.
>> 
>>    We worked around this by passing "--mca pml ob1".  I notice now with 
>> openmpi
>>    2.0.2 without that option I no longer see errors, but the mpi program will
>>    hang shortly after startup.  Re-adding the option makes it work, so I'm
>>    assuming the underlying problem is still the same, but openmpi appears to 
>> have
>>    stopped alerting me to the issue.
>> 
>>    Thoughts?
>> 
>>    --
>>    Orion Poplawski
>>    Technical Manager                          720-772-5637
>>    NWRA, Boulder/CoRA Office             FAX: 303-415-9702
>>    3380 Mitchell Lane                       or...@nwra.com
>>    <mailto:or...@nwra.com>
>>    Boulder, CO 80301                   http://www.nwra.com
>>    _______________________________________________
>>    users mailing list
>>    users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>    https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> 
> 
> 
> -- 
> Orion Poplawski
> Technical Manager                          720-772-5637
> NWRA, Boulder/CoRA Office             FAX: 303-415-9702
> 3380 Mitchell Lane                       or...@nwra.com
> Boulder, CO 80301                   http://www.nwra.com
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to