Dear Open MPI User's and Developers,

I encountered a problem with Open MPI when porting an application, which
successfully ran with LAM MPI and MPICH.

The program produces a segmentation fault (see [1] for the stack trace) when
doing the MPI_Send with the following arguments:

MPI_Send((void *)0, 1, datatype, rank, tag, comm_); 

The first argument seems to be wrong at first sight, but is correct because
the argument "datatype" is an MPI_Datatype, 
which describes the memory layout of the object to be sent and is
zero-based. The other arguments are as expected: one such object is sent to
rank "rank" with tag "tag" with the help of the communicator "comm_". The
MPI_Datatype is constructed programmatically from the objects member
definitions using MPI_Type_struct. The MPI types involved are solely 
MPI_DOUBLE and MPI_UNSIGNED_INT.

I can reproduce the problem with the stable 1.2 release as well as the
1.2.1a snapshot of Open MPI.
My OS is Linux with Kernel 2.6.18 (Debian Etch) running on standard Dual
Xeon Hardware with GigE.

I tried to reduce the amount of data sent by excluding some of the object's
members from the transmission. There does not seem to be a certain member or
type which causes the problem. There seems to be a limit of
members/data/size which determines  the success of the call. The "datatype"
structure describes the type and location of approx. 2'000'000 numbers. The
data itself is approx. 16MB (2M * 8 bytes/number assuming doubles), which I
expect not to cause any problem to a MPI implementation.

Thank you for hints, ideas or suggestions where the problem could be.

Regards, 
Michael

[1]

[head:09133] *** Process received signal ***
[head:09133] Signal: Segmentation fault (11)
[head:09133] Signal code: Address not mapped (1)
[head:09133] Failing at address: 0xb0127475
[head:09133] [ 0] [0xb7f0f440]
[head:09133] [ 1] /usr/lib/libmpi.so.0(ompi_convertor_pack+0x90)
[0xb668f9a0]
[head:09133] [ 2]
/usr/lib/openmpi/mca_btl_tcp.so(mca_btl_tcp_prepare_src+0x210) [0xb56daef0]
[head:09133] [ 3]
/usr/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_exclusive+
0x1de) [0xb5726ede]
[head:09133] [ 4] /usr/lib/openmpi/mca_pml_ob1.so [0xb5728238]
[head:09133] [ 5] /usr/lib/openmpi/mca_btl_tcp.so [0xb56ddc65]
[head:09133] [ 6] /usr/lib/libopen-pal.so.0(opal_event_base_loop+0x462)
[0xb65bcf12]
[head:09133] [ 7] /usr/lib/libopen-pal.so.0(opal_event_loop+0x29)
[0xb65bcfd9]
[head:09133] [ 8] /usr/lib/libopen-pal.so.0(opal_progress+0xc0) [0xb65b7260]
[head:09133] [ 9] /usr/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x3e5)
[0xb571f965]
[head:09133] [10] /usr/lib/libmpi.so.0(MPI_Send+0x12f) [0xb66abf0f]
[head:09133] [11]
/opt/plato/release_1.0/bin/engine(_ZN2GP15MPIProcessGroup4sendERKNS_9MemoryM
apEii+0xd9) [0x81cec03]
[head:09133] [12]
/opt/plato/release_1.0/bin/engine(_ZN2GP15MPIProcessGroup4sendEN5boost10shar
ed_ptrINS_6EntityEEEii+0x2d0) [0x81d0358]
[head:09133] [13]
/opt/plato/release_1.0/bin/engine(_ZN2GP20ParallelDataAccessor4loadEN5boost1
0shared_ptrINS_6EntityEEE+0x23b) [0x853c939]
[head:09133] [14]
/opt/plato/release_1.0/bin/engine(_ZN2GP12Transactions6createEPKN11xercesc_2
_77DOMNodeE+0x57f) [0x8426553]
[head:09133] [15]
/opt/plato/release_1.0/bin/engine(_ZN2GP7FactoryIN5boost10shared_ptrINS_7Xml
BaseEEESsPFS4_PKN11xercesc_2_77DOMNodeEENS_19DefaultFactoryErrorEE12createOb
jectES8_+0x76) [0x81ca06a]
[head:09133] [16]
/opt/plato/release_1.0/bin/engine(_ZN2GP16XmlFactoryParser7descentEPN11xerce
sc_2_77DOMNodeEb+0x5b2) [0x81cd700]
[head:09133] [17]
/opt/plato/release_1.0/bin/engine(_ZN2GP9XmlParser8traverseEb+0x278)
[0x81c1eca]
[head:09133] [18]
/opt/plato/release_1.0/bin/engine(_ZN2GP16XmlFactoryParser8traverseEb+0x19)
[0x81c9eeb]
[head:09133] [19] /opt/plato/release_1.0/bin/engine(main+0x1d23) [0x81617f7]
[head:09133] [20] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xc8)
[0xb6348ea8]
[head:09133] [21]
/opt/plato/release_1.0/bin/engine(__gxx_personality_v0+0x15d) [0x815a731]
[head:09133] *** End of error message ***
mpirun noticed that job rank 0 with PID 9133 on node head exited on signal
11 (Segmentation fault).
2 additional processes aborted (not shown)




Reply via email to