Hi Galen, Yes, that worked! Thanks so much!
-sophia -----Original Message----- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Galen Shipman Sent: Monday, June 11, 2007 2:15 PM To: Open MPI Users Subject: Re: [OMPI users] Open MPI issue with Iprobe I think the problem is that we use MPI_STATUS_IGNORE in the C++ bindings but don't check for it properly in mtl_mx_iprobe, can you try applying this diff to ompi and having the user try again, we will also push this into the 1.2 branch. - Galen Index: ompi/mca/mtl/mx/mtl_mx_probe.c =================================================================== --- ompi/mca/mtl/mx/mtl_mx_probe.c (revision 14997) +++ ompi/mca/mtl/mx/mtl_mx_probe.c (working copy) @@ -58,11 +58,12 @@ } if (result) { - status->MPI_ERROR = OMPI_SUCCESS; - MX_GET_SRC(mx_status.match_info, status->MPI_SOURCE); - MX_GET_TAG(mx_status.match_info, status->MPI_TAG); - status->_count = mx_status.msg_length; - + if(MPI_STATUS_IGNORE != status) { + status->MPI_ERROR = OMPI_SUCCESS; + MX_GET_SRC(mx_status.match_info, status->MPI_SOURCE); + MX_GET_TAG(mx_status.match_info, status->MPI_TAG); + status->_count = mx_status.msg_length; + } *flag = 1; } else { *flag = 0; On Jun 11, 2007, at 12:55 PM, Corwell, Sophia wrote: > Hi, > > We are seeing the following issue with Iprobe on our clusters running > openmpi-1.2.2. Here is the code and related information: > > ======= > Modules currently loaded: > > (sn31)/projects>module list >>> Currently Loaded Modulefiles: >>> 1) /opt/modules/oscar-modulefiles/default-manpath/1.0.1 >>> 2) compilers/intel-9.1-f040-c045 >>> 3) misc/env-openmpi-1.2 >>> 4) mpi/openmpi-1.2.2_mx_intel-9.1-f040-c045 >>> 5) libraries/intel-mkl > ======= > > Source code: > >>> >>> (sn31)/projects/>more probeTest.cc >>> >>> #include <mpi.h> >>> #include <cassert> >>> >>> int main(int argc, char* argv[]) >>> { >>> MPI::Init(argc, argv); >>> >>> const int rank = MPI::COMM_WORLD.Get_rank(); >>> const int size = MPI::COMM_WORLD.Get_size(); >>> const int sendProc = (rank + size - 1) % size; >>> const int recvProc = (rank + 1) % size; >>> const int tag = 1; >>> >>> // send an asynchronous message >>> const int sendVal = 1; >>> MPI::Request sendRequest = >>> MPI::COMM_WORLD.Isend(&sendVal, 1, MPI_INT, recvProc, tag); >>> >>> // wait for message to arrive >>> while (!MPI::COMM_WORLD.Iprobe(sendProc, tag)) {} // This line >>> causes problems >>> >>> // Receive asynchronous message >>> int recvVal; >>> MPI::Request recvRequest = >>> MPI::COMM_WORLD.Irecv(&recvVal, 1, MPI_INT, sendProc, tag); >>> recvRequest.Wait(); >>> >>> MPI::Finalize(); >>> } > ======= > > Compiled with: > >>> (sn31)/projects>/apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi >>> -1.2.2_mx/bin/mpicxx >>> -I/apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_m >>> x/include -g -c -o probeTest.o probeTest.cc >>> >>> (sn31)/projects>/apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi >>> -1.2.2_mx/bin/mpicxx -g -o probeTest >>> -L/apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/lib >>> probeTest.o -lmpi >>> > /projects/global/x86_64/compilers/intel/intel-9.1-cce-045/lib/ > ibimf.so: >>> warning: warning: feupdateenv is not implemented and will always >>> fail >>> > > ======= > > Error at runtime: > >>> >>> (sn31)/projects>mpiexec -n 1 ./probeTest [sn31:17616] *** Process >>> received signal *** [sn31:17616] Signal: >>> Segmentation fault (11) [sn31:17616] Signal code: Address not mapped >>> (1) [sn31:17616] Failing at address: 0x8 [sn31:17616] [ 0] >>> /lib64/tls/libpthread.so.0 [0x2a9665a4f0] [sn31:17616] [ 1] >>> /apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/ >>> lib/openmpi/mca_mtl_mx.so(ompi_mtl_mx_iprobe+0x81) >>> [0x2a9980b305] >>> [sn31:17616] [ 2] >>> /apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/ >>> lib/openmpi/mca_pml_cm.so(mca_pml_cm_iprobe+0x1f) >>> [0x2a995eb817] >>> [sn31:17616] [ 3] >>> /apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/ >>> lib/libmpi.so.0(MPI_Iprobe+0xef) >>> [0x2a956d363f] >>> [sn31:17616] [ 4] ./probeTest(_ZNK3MPI4Comm6IprobeEii+0x3a) >>> [0x4046aa][sn31:17616] [ 5] ./probeTest(main+0x147) [0x40480b] >>> [sn31:17616] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb) >>> [0x2a967803fb] >>> [sn31:17616] [ 7] ./probeTest(_ZNSt8ios_base4InitD1Ev+0x3a) >>> [0x4038ca][sn31:17616] *** End of error message *** mpiexec noticed >>> that job rank 0 with PID 17616 on node sn31 exited > on >>> signal 11 (Segmentation fault). >>> >>> (sn31)/projects/ceptre/sdpautz/NWCC/temp>mpiexec -n 2 ./probeTest >>> [sn31:17621] *** Process received signal *** [sn31:17620] *** > Process >>> received signal *** [sn31:17620] Signal: Segmentation fault (11) >>> [sn31:17620] Signal code: Address not mapped (1) [sn31:17620] >>> Failing at address: 0x8 [sn31:17620] [ 0] /lib64/tls/libpthread.so.0 > >>> [0x2a9665a4f0] [sn31:17620] [ 1] >>> /apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/ >>> lib/openmpi/mca_mtl_mx.so(ompi_mtl_mx_iprobe+0x81) >>> [0x2a9980b305] >>> [sn31:17620] [ 2] >>> /apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/ >>> lib/openmpi/mca_pml_cm.so(mca_pml_cm_iprobe+0x1f) >>> [0x2a995eb817] >>> [sn31:17620] [ 3] >>> /apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/ >>> lib/libmpi.so.0(MPI_Iprobe+0xef) >>> [0x2a956d363f] >>> [sn31:17620] [ 4] ./probeTest(_ZNK3MPI4Comm6IprobeEii+0x3a) >>> [0x4046aa][sn31:17620] [ 5] ./probeTest(main+0x147) [0x40480b] >>> [sn31:17620] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb) >>> [0x2a967803fb] >>> [sn31:17620] [ 7] ./probeTest(_ZNSt8ios_base4InitD1Ev+0x3a) >>> [0x4038ca][sn31:17620] *** End of error message *** [sn31:17621] >>> Signal: Segmentation fault (11) [sn31:17621] Signal code: Address >>> not mapped (1) [sn31:17621] Failing at address: 0x8 [sn31:17621] [ >>> 0] /lib64/tls/libpthread.so.0 [0x2a9665a4f0] [sn31:17621] [ 1] >>> /apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/ >>> lib/openmpi/mca_mtl_mx.so(ompi_mtl_mx_iprobe+0x81) >>> [0x2a9980b305] >>> [sn31:17621] [ 2] >>> /apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/ >>> lib/openmpi/mca_pml_cm.so(mca_pml_cm_iprobe+0x1f) >>> [0x2a995eb817] >>> [sn31:17621] [ 3] >>> /apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/ >>> lib/libmpi.so.0(MPI_Iprobe+0xef) >>> [0x2a956d363f] >>> [sn31:17621] [ 4] ./probeTest(_ZNK3MPI4Comm6IprobeEii+0x3a) >>> [0x4046aa][sn31:17621] [ 5] ./probeTest(main+0x1ad) [0x404871] >>> [sn31:17621] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb) >>> [0x2a967803fb] >>> [sn31:17621] [ 7] ./probeTest(_ZNSt8ios_base4InitD1Ev+0x3a) >>> [0x4038ca][sn31:17621] *** End of error message *** mpiexec noticed >>> that job rank 0 with PID 17620 on node sn31 exited > on >>> signal 11 (Segmentation fault). >>> 1 additional process aborted (not shown) >>> > > ======= > > Additional Information: > >>> It appears that the call of Iprobe causes problems; if that line is >>> taken out, the code completes normally. Failures also occur with > the gcc compilers. > >>> Mpich appears to work, at least for the Intel compiler. > > ======= > > Hardware information: > > [root@spirit1 ~]# mx_info -q > MX Version: 1.2.1-rc20 > MX Build: > r...@tocc1.sandia.gov:/projects/global/src/myricom/mx-1.2.1-rc20 > Thu Jun > 7 17:08:02 MDT 2007 > 1 Myrinet board installed. > The MX driver is configured to support a maximum of: > 8 endpoints per NIC, 1024 NICs on the network, 32 NICs per > host > =================================================================== > Instance #0: 333.2 MHz LANai, 133.3 MHz PCI bus, 4 MB SRAM > Status: Running, P0: Link up, P1: Link up > Network: Myrinet 2000 > > MAC Address: 00:60:dd:48:ba:ae > Product code: M3F2-PCIXE-4 > Part number: 09-02878 > Serial number: 219851 > Mapper (P0): 00:60:dd:48:c0:08, version = 0x01920f75, > configured > Mapped hosts: 506 > Mapper (P1): 00:60:dd:48:c0:08, version = 0x01920f75, > configured > Mapped hosts: 506 > > > cat > /apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/ > BUILD_ENV > # Build Environment: > USE="doc icc modules mx torque" > COMPILER="intel-9.1-f040-c045" > CC="icc" > CXX="icpc" > CLINKER="icc" > FC="ifort" > F77="ifort" > CFLAGS=" -O3 -pipe" > CXXFLAGS=" -O3 -pipe" > FFLAGS=" -O3" > MODULE_DEST="/apps/modules/modulefiles/mpi" > MODULE_FILE="openmpi-1.2.2_mx_intel-9.1-f040-c045" > INSTALL_DEST="/apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/ > openmpi-1.2.2 > _mx" > CONF_FLAGS=" --with-mx=/opt/mx --with-tm=/apps/torque" > ======= > > Thanks in advance for any help/advice you can provide. > > -Sophia > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users