On Mar 14, 2006, at 4:42 AM, Pierre Valiron wrote:

I am now attempting to tune openmpi-1.1a1r9260 on Solaris Opteron.

I guess I should have pointed this out more clearly earlier. Open MPI 1.1a1 is a nightly build of alpha release from our development trunk. It isn't guaranteed to be stable. About the only guarantee made is that it passed "make distcheck" on the Linux box we use to make tarballs.

The Solaris patches have been moved over to the v1.0 release branch, so if stability is a concern, you might want to switch back to a nightly tarball from the v1.0 release. We should also be having another beta release of the 1.0.2 release in the near future.

Each quadripro node possess two ethernet interfaces bge0 and bge1.
Interfaces bge0 are dedicated to parallel jobs and correspond to node
names pxx,
they use a dedicated gigabit switch.
Interfaces bge1 provide nfs sharing etc and correspond to node names nxx
over another gigabit switch.

1) I allocated 4 quadripro nodes.
As documented in the FAQ, mpirun -np 4 -hostfile $OAR_FILE_NODES runs 4
tasks on the first SMP, and mpirun -np 4 -hostfile $OAR_FILE_NODES
--bynode distributes a task on each node.

2) According to the users list, mpirun --mca pml teg should revert to
2nd generation TCP instead of default ob1 (3rd gen). Unfortunately I get
the message
No available pml components were found!
Have you removed the 2nd generation TCP transport ? Do you consider the
new ob1 is competitive now ?

On the development trunk, we have removed the TEG PML and all the PTLs. The OB1 PML provides competitive (and most of the time better) performance than the TEG PML for most transports. The major issue is that when we added one-sided communication, we used the BTL transports directly. The BTL and PTL frameworks were not designed to live together, so issues were caused with the TEG PML.

3) According to the users list, tuned collective primitives are
available. Apparently they are now compiled by default, but the don't
seem functional at all:

mpirun --mca coll tuned
Signal:11 info.si_errno:0(Error 0) si_code:1(SEGV_MAPERR)
Failing at addr:0
*** End of error message ***

Tuned collectives are available, but not as heavily tested as the basic collectives. Do you have a test case in particular that causes problems?

4) According to the FAQ and to the users list, openmpi attempts to
discover and use all interfaces. I attempted to force using bge0 only
with no success.

mpirun --mca btl_tcp_if_exclude bge1
[n33:04784] *** An error occurred in MPI_Barrier
[n33:04784] *** on communicator MPI_COMM_WORLD
[n33:04784] *** MPI_ERR_INTERN: internal error
[n33:04784] *** MPI_ERRORS_ARE_FATAL (goodbye)
1 process killed (possibly by Open MPI)

That definitely shouldn't happen - Can you reconfigure / compile with the option --enable-debug, then run with the added option --mca btl_base_debug 2 and send the output you see to us? That might help in diagnosing the problem.

In the FAQ it is stated that a new syntax should be available soon. I
tried if it is already implemented in openmpi-1.1a1r9260

mpirun --mca btl_tcp_if ^bge0,bge1
mpirun --mca btl_tcp_if ^bge1
works with identical performances.

This syntax only works for specifying component names, not interface names. So you would still need to use the btl_tcp_if_include and btl_tcp_if_exclude options.

Brian


--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/


Reply via email to