Hello,

I recently tried running HPLinpack, compiled with OMPI, over myrinet
MX interconnect. Running a simple hello world program works, but XHPL
fails with an error occurring when it tries to MPI_Send:

# mpirun -np 4 -H l0-0,c0-2 --prefix $MPIHOME --mca btl mx,self
/opt/hpl/openmpi-hpl/bin/xhpl
[l0-0.local:04707] *** An error occurred in MPI_Send
[l0-0.local:04707] *** on communicator MPI_COMM_WORLD
[l0-0.local:04707] *** MPI_ERR_INTERN: internal error
[l0-0.local:04707] *** MPI_ERRORS_ARE_FATAL (goodbye)
mpirun noticed that job rank 0 with PID 4706 on node "l0-0" exited on signal 15.
3 additional processes aborted (not shown)

# mpirun -np 4 -H l0-0,c0-2 --prefix $MPIHOME --mca btl mx,self ~/atumanov/hello
Hello from Alex' MPI test program
Process 1 on compute-0-2.local out of 4
Hello from Alex' MPI test program
Hello from Alex' MPI test program
Process 0 on l0-0.local out of 4
Process 3 on compute-0-2.local out of 4
Hello from Alex' MPI test program
Process 2 on l0-0.local out of 4

The output from mx_info is as follows:
-------------------------------------------------------------------------------------------------
MX Version: 1.2.0g
MX Build: 
r...@blackopt.sw.myri.com:/home/install/rocks/src/roll/myrinet_mx10g/BUILD/mx-1.2.0g
Wed Jan 17 18:51:12 PST 2007
1 Myrinet board installed.
The MX driver is configured to support up to 4 instances and 1024 nodes.
===================================================================
Instance #0:  299.8 MHz LANai, PCI-E x8, 2 MB SRAM
       Status:         Running, P0: Link up
       MAC Address:    00:60:dd:47:7d:73
       Product code:   10G-PCIE-8A-C
       Part number:    09-03362
       Serial number:  314581
       Mapper:         00:60:dd:47:7d:73, version = 0x591b1c74, configured
       Mapped hosts:   2

                                                               ROUTE COUNT
INDEX    MAC ADDRESS     HOST NAME                                P0
-----    -----------     ---------                                ---
  0) 00:60:dd:47:7d:73 compute-0-2.local:0                     D 0,0
  1) 00:60:dd:47:7d:72 l0-0.local:0                        1,0
-------------------------------------------------------------------------------------------------

There are several questions. First of all, am I able to initiate OMPI
over MX jobs from the headnode to be executed on 2 compute nodes even
though the headnode does not have MX hardware? Secondly, looking at
next to last line in the mx_info output, what does  letter 'D' stand
for? Third, the MX interconnect support OMPI provides - does it mean
MX-2G or there's support for MX-10G as well?

If anybody has encountered a similar problem and was able to
circumvent it please do let me know.

Many thanks for your time and for bringing the community together.

Sincerely,
Alex.

Reply via email to