On Apr 16, 2007, at 6:48 PM, Adams, Brian M wrote:
I am attempting to port Sandia's DAKOTA code from MVAPICH to the
default
OpenMPI/Intel environment on Sandia's thunderbird cluster. I can
successfully build DAKOTA in the default tbird software
environment, but
I'm having runtime problems when DAKOTA attempts to make a system
call.
Typical output looks like:
[0,1,1][btl_openib_component.c:897:mca_btl_openib_component_progress]
from an64 to: an64 error polling HP CQ with status LOCAL LENGTH ERROR
status number 1 for wr_id 5714048 opcode 0
Unfortunately, making calls to system() or fork() will fail when
using the OFED 1.1 stack (such as on thunderbird). The fun part is
that the failure is not immediate; calling fork() or system() will
cause odd/interesting errors later in your program (such as what you
described above).
The only way around this is to call fork()/system() before the call
to MPI_INIT or after the call to MPI_FINALIZE.
The OFED 1.2 stack has proper support for fork()/system(), but I
don't know what tbird's plans are for upgrading (I doubt it has been
discussed yet since OFED 1.2 is still going through its release
process -- it's not final yet).
Note: Both programs run fine with MVAPICH on tbird,
This is probably luck; I wouldn't count on it happening reliably.
and with OpenMPI or
MPICH on my Linux x86_64 SMP workstation.
There are many environments where fork() and system() work fine
(e.g., when only using tcp and shared memory), but the OFED 1.1 stack
is unfortunately not one of them.
I wish I had a better answer for you, but I don't. Sorry!
--
Jeff Squyres
Cisco Systems