Hi,

We are migrating to Open MPI 1.6 but since 1.6 dropped support for
Myricom GM driver so we have to switch to the MX driver. We have the
Myricom MX2G 1.2.16 driver installed. However upon testing the new
build of Open MPI on a node without the actual Myrinet device, we are
getting the following segmentation fault.

<---->
[yqin@n0007.scs00 ~]$ mpirun -np 2  -np 2 osu_bw
[n0007.scs00:03075] Error in mx_open_endpoint (error No MX device
entry in /dev.)
[n0007.scs00:03074] Error in mx_open_endpoint (error No MX device
entry in /dev.)
--------------------------------------------------------------------------
[[32626,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: Myrinet/MX
  Host: n0007.scs00

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
[n0007:03074] *** Process received signal ***
[n0007:03074] Signal: Segmentation fault (11)
[n0007:03074] Signal code: Invalid permissions (2)
[n0007:03074] Failing at address: 0x2b9112128130
[n0007:03075] *** Process received signal ***
[n0007:03075] Signal: Segmentation fault (11)
[n0007:03075] Signal code: Invalid permissions (2)
[n0007:03075] Failing at address: 0x2b041c9f1130
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 3075 on node n0007.scs00
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[n0007.scs00:03073] 1 more process has sent help message
help-mpi-btl-base.txt / btl:no-nics
[n0007.scs00:03073] Set MCA parameter "orte_base_help_aggregate" to 0
to see all help / error messages
<---->

Excluding the MX BTL does not get anywhere further.

<---->
[yqin@n0007.scs00 ~]$ mpirun -np 2 -mca btl ^mx -np 2 osu_bw
[n0007.scs00:03453] Error in mx_open_endpoint (error No MX device
entry in /dev.)
[n0007.scs00:03454] Error in mx_open_endpoint (error No MX device
entry in /dev.)
[n0007:03453] *** Process received signal ***
[n0007:03453] Signal: Segmentation fault (11)
[n0007:03453] Signal code: Address not mapped (1)
[n0007:03453] Failing at address: 0x2b3c1fe73130
[n0007:03454] *** Process received signal ***
[n0007:03454] Signal: Segmentation fault (11)
[n0007:03454] Signal code: Address not mapped (1)
[n0007:03454] Failing at address: 0x2b2431bf0130
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 3454 on node n0007.scs00
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
<---->

If we use only designated BTL such as SM and SELF, the binary runs but
still getting segmentation fault towards the end.

<---->
[yqin@n0007.scs00 ~]$ mpirun -np 2 -mca btl sm,self -np 2 osu_bw
[n0007.scs00:03460] Error in mx_open_endpoint (error No MX device
entry in /dev.)
[n0007.scs00:03461] Error in mx_open_endpoint (error No MX device
entry in /dev.)
# OSU MPI Bandwidth Test v3.3
# Size        Bandwidth (MB/s)
1                         2.54
2                         5.22
4                        10.92
8                        21.61
16                       43.89
32                       62.19
64                      121.95
128                     212.28
256                     337.52
512                     516.67
1024                    701.29
2048                    845.69
4096                    836.45
8192                    934.31
16384                  1035.53
32768                  1186.90
65536                  1390.41
131072                 1519.14
262144                 1562.96
524288                 1596.78
1048576                1611.48
2097152                1616.09
4194304                1620.47
[n0007:03461] *** Process received signal ***
[n0007:03460] *** Process received signal ***
[n0007:03460] Signal: Segmentation fault (11)
[n0007:03460] Signal code: Address not mapped (1)
[n0007:03460] Failing at address: 0x2acac044d130
[n0007:03461] Signal: Segmentation fault (11)
[n0007:03461] Signal code: Address not mapped (1)
[n0007:03461] Failing at address: 0x2b8bc4121130
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 3460 on node n0007.scs00
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
<---->


Can anybody shed some light here? It looks like ompi is trying to open
the MX device no matter what. This is on a fresh build of Open MPI 1.6
with "--with-mx --with-openib" options. We didn't have such an issue
with the old GM BTL.

Thanks,

Yong Qin

Reply via email to