Hi all, We have a code built with OpenMPI (v1.4.3) and the Intel v12.0 compiler that has been tested successfully on 10s - 100s of cores on our cluster. We recently ran the same code with 1020 cores and received the following runtime error:
> [d6cneh042:28543] *** Process received signal *** > [d6cneh061:29839] Signal: Segmentation fault (11) > [d6cneh061:29839] Signal code: Address not mapped (1) > [d6cneh061:29839] Failing at address: 0x10 > [d6cneh030:26800] Signal: Segmentation fault (11) > [d6cneh030:26800] Signal code: Address not mapped (1) > [d6cneh030:26800] Failing at address: 0x21 > [d6cneh042:28543] Signal: Segmentation fault (11) > [d6cneh042:28543] Signal code: Address not mapped (1) > [d6cneh042:28543] Failing at address: 0x10 > [d6cneh021:27646] [ 0] /lib64/libpthread.so.0 [0x39aee0eb10] > [d6cneh021:27646] [ 1] /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi.so.0 > [0x2af8b1c8bca8] > [d6cneh021:27646] [ 2] /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi.so.0 > [0x2af8b1c8a1ef] > [d6cneh021:27646] [ 3] /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi.so.0 > [0x2af8b1c16246] > [d6cneh021:27646] [ 4] > /opt/crc/openmpi/1.4.3/intel-12.0/lib/libopen-pal.so.0(opal_progress+0x86) > [0x2af8b22a6a26] > [d6cneh021:27646] [ 5] /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi.so.0 > [0x2af8b1c879e7] > [d6cneh021:27646] [ 6] /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi.so.0 > [0x2af8b1c1f701] > [d6cneh021:27646] [ 7] /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi.so.0 > [0x2af8b1c1aec9] > [d6cneh021:27646] [ 8] > /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi.so.0(MPI_Allreduce+0x73) > [0x2af8b1be6203] > [d6cneh021:27646] [ 9] > /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi_f77.so.0(MPI_ALLREDUCE+0xc5) > [0x2af8b1977715] > [d6cneh021:27646] [10] openmd_MPI [0x5e0b94] > [d6cneh021:27646] [11] openmd_MPI [0x599877] > [d6cneh021:27646] [12] openmd_MPI [0x5746e8] > [d6cneh021:27646] [13] openmd_MPI [0x4f18b8] Can anyone give some insight into the issue? I should note (as it may be relevant) that this job was run across a heterogeneous cluster of Intel Nehalem servers with a mixture of Infiniband and Ethernet connections. The OpenMPI compiler was built with no IB libraries (so I am assuming everything defaults to a TCP transport?). Thanks in advance for any insight that may help us identify the issue. Regards. Tim. Tim Stitt PhD (User Support Manager). Center for Research Computing | University of Notre Dame | P.O. Box 539, Notre Dame, IN 46556 | Phone: 574-631-5287 | Email: tst...@nd.edu<mailto:tst...@nd.edu>