On Jun 2, 2010, at 9:54 AM, Jeff Squyres wrote:

>> this is the output I get on a node with ethernet and infiniband hardware.
>> note the Error regarding mx.
>> 
>> $ ~/openmpi-1.4.2-bin/bin/mpirun ~/bwlat/mpi_helloworld
>> [bordeplage-9.bordeaux.grid5000.fr:32365] Error in mx_init (error No MX
>> device entry in /dev.)

This is ompi_common_mx_initialize(). It fails since there is no MX and prints 
the above with:

opal_output(0, "Error in mx_init (error %s)\n", mx_strerror(mx_return));
return OMPI_ERR_NOT_AVAILABLE;

>> [bordeplage-9.bordeaux.grid5000.fr:32365] mca_btl_mx_component_init:
>> mx_get_info(MX_NIC_COUNT) failed with status 4(MX not initialized.)
> 
> I'm guessing the MX BTL is designed to be noisy when it fails, on the 
> assumption that if MX is down, you probably want to know it.
> 
> George/Myricom -- can you confirm?

This is odd. The ompi_common_mx_initialize() above does not return OPAL_SUCCESS 
to mca_btl_mx_component_init(). It should return NULL and never call 
mx_get_info(). This too uses a opal_output(0, ...).

I will let George comment on the verbosity.

It looks like ompi_common_mx_initialize() is doing things that affect memory 
before calling mx_init() such as setting ompi_mpi_leave_pinned to 1 and setting 
mpool_resources.regcache_clean = mx__regcache_clean.

There is a chicken-and-egg scenario. The BTL needs to set an registration cache 
environment variable before calling mx_init(), but the altering of mpool 
resources should probably wait until after the fact in case MX is not available.

Does the same error happen if he tries on a MX host that does not have IB?

Scott

Reply via email to