Hi all,
We had a customer using 1.2.6 with MX. We were running his jobs, some
of which used the MX BTL and some used the MX MTL.
He added a few more nodes to the cluster and installed the same OMPI.
When we tried to run jobs that spanned the new nodes, the jobs failed.
I did not keep the error messages, but it seems to be a standard
message about a component such as "self" not found.
The problem in fact was that he installed OMPI, but for some reason
neither the MX BTL nor the MX MTL were installed. Thus, the failure. I
do not believe the error message for the BTL runs ever specifically
mentioned a missing MX component even though we were setting "--mca
btl self,sm,mx" (we did not specify MX when using the MTL, we only
used "--mca pml cm".
It would be helpful in the case where a OMPI cannot run _and_ a module
is specifically requested but not available to be mentioned in the
error message.
Thanks,
Scott
- [OMPI users] Improving error messages Scott Atchley
-