Re: [OMPI users] Open MPI - Signal: Segmentation fault (11) Problem

2007-04-16 Thread Michael Gauckler
Hi George, 

thank you for replying and the hint of using MPI_BOTTOM. I changed this 
part of the code and still receive the same segmentation fault.

Unfortunately I cannot post a full example, but here is the code that 
seems most relevant to the problem.

The mechanism is as follows: From object that needs to be transmitted
a list is created which describes the members with their type, offset
and stride (the MemoryMapDescr).  MemoryMap::mapType is used to put 
the members into this list, the so called MemoryMap.

>From this vector of MemoryMapDescr a MPI_Datatype is constructed, which
then is used to transmit the object.

Maybe you could have a look at the code fragments and see if you spot
something that does not go well with OpenMPI.

The testing today showed again the behavior that the size of the 
data structures triggers the problem. This can be either probabilistic
(more processing gives a higher chance that something goes wrong) or
that there is a real dependence, e.g. some buffer is too small or the
differences of the addresses in memory are too large, or I don't know
what else to think of.

Thank you for your help.

Regards,
Michael


int createMPIDataType(const std::vector& memorymap,
MPI_Datatype )
{
int err = MPI_SUCCESS;
int num = memorymap.size();

MPI_Datatype *types = new MPI_Datatype[num];
int *lengths = new int[num];
MPI_Aint *addresses = new MPI_Aint[num];

// copy the vector with information about the type in temp. 
// arrays to be handled by MPI_Type_struct
for (int i = 0; i < num; i++)
{ 
types[i] = MPIDataType[memorymap[i].type];
lengths[i] = memorymap[i].len;

// create address map according to actual memory layout
err = MPI_Address(memorymap[i].addr, [i]);

if (err != MPI_SUCCESS)
{
std::ostringstream msg;
msg << "invalid address at index " << i;
msg << " for type " <<
DataTypeNames[memorymap[i].type];
msg << " at address " << memorymap[i].addr;
GP_THROW_ERR(CommunicationErr, eMPIAddressError,
msg.str());
}
}

// create MPI datatype with equivalent information about types and
offsets
err = MPI_Type_struct(num, lengths, addresses, types, );

if (err != MPI_SUCCESS)
{
GP_THROW_ERR(CommunicationErr, eMPIDatatypeError, "invalid
MPI datatype");
}

err = MPI_Type_commit();

// Invalid datatype argument. May be an uncommitted MPI_Datatype
(see MPI_Type_commit).
if (err != MPI_SUCCESS)
{
GP_THROW_ERR(CommunicationErr, eMPIDatatypeError, "invalid
MPI datatype");
}

// delete temp. arrays
delete [] types;
delete [] lengths;
delete [] addresses;

return err;
}


// Memory map descriptor.
// TODO: Add support for strided vectors.

struct MemoryMapDescr
{
MemoryMapDescr(DataType t, void* a, int l);

//! Data type.
DataType type;

//! Address of data in memory.
void* addr;

//! Number of data elements.
int len;

//! Stride.
// TODO: Add support for strided vectors.
int stride;

//! Type name string.
std::string typeName() const;
};


template
void MemoryMap::mapType(const T& var)
{
memoryMap_.push_back(MemoryMapDescr(DataTypeConverter::type,
(void*), 1));
}

// With specializations such as this following exemplified by a vector of
doubles.
template<>
void MemoryMap::mapType< std::vector >(const std::vector
)
{
if (var.size() > 0) 
memoryMap_.push_back(MemoryMapDescr(DataTypeConverter::type,
(void*)[0], var.size()));
}



-

Message: 1
List-Post: users@lists.open-mpi.org
Date: Wed, 11 Apr 2007 12:33:25 -0400
From: George Bosilca <bosi...@cs.utk.edu>
Subject: Re: [OMPI users] Open MPI - Signal: Segmentation fault (11)
Problem
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <ec121f41-d927-45a6-a511-32d1ff06e...@cs.utk.edu>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed

Michael,

The MPI standard is quite clear. In order to have a correct and  
portable MPI code, you are not allowed to use (void*)0. Use  
MPI_BOTTOM instead.

We have plenty of tests which test the exact behavior you describe in  
your email. And they all pass. I will take a look at what's happens  
but I need either the code or at least the part which create the  
datatype.

   Thanks,
 george.

On Apr 11, 2007, at 3:54 AM, Michael Gauckler wrote:

> Dear Open MPI User's and Developers,
>
> I encountered a probl

Re: [OMPI users] Open MPI - Signal: Segmentation fault (11) Problem

2007-04-11 Thread George Bosilca

Michael,

The MPI standard is quite clear. In order to have a correct and  
portable MPI code, you are not allowed to use (void*)0. Use  
MPI_BOTTOM instead.


We have plenty of tests which test the exact behavior you describe in  
your email. And they all pass. I will take a look at what's happens  
but I need either the code or at least the part which create the  
datatype.


  Thanks,
george.

On Apr 11, 2007, at 3:54 AM, Michael Gauckler wrote:


Dear Open MPI User's and Developers,

I encountered a problem with Open MPI when porting an application,  
which successfully ran with LAM MPI and MPICH.


The program produces a segmentation fault (see [1] for the stack  
trace) when doing the MPI_Send with the following arguments:


MPI_Send((void *)0, 1, datatype, rank, tag, comm_);

The first argument seems to be wrong at first sight, but is correct  
because the argument "datatype" is an MPI_Datatype,
which describes the memory layout of the object to be sent and is  
zero-based. The other arguments are as expected: one such object is  
sent to rank "rank" with tag "tag" with the help of the  
communicator "comm_". The MPI_Datatype is constructed  
programmatically from the objects member definitions using  
MPI_Type_struct. The MPI types involved are solely

MPI_DOUBLE and MPI_UNSIGNED_INT.

I can reproduce the problem with the stable 1.2 release as well as  
the 1.2.1a snapshot of Open MPI.
My OS is Linux with Kernel 2.6.18 (Debian Etch) running on standard  
Dual Xeon Hardware with GigE.


I tried to reduce the amount of data sent by excluding some of the  
object's members from the transmission. There does not seem to be a  
certain member or type which causes the problem. There seems to be  
a limit of members/data/size which determines  the success of the  
call. The "datatype" structure describes the type and location of  
approx. 2'000'000 numbers. The data itself is approx. 16MB (2M * 8  
bytes/number assuming doubles), which I expect not to cause any  
problem to a MPI implementation.


Thank you for hints, ideas or suggestions where the problem could be.

Regards,
Michael

[1]

[head:09133] *** Process received signal ***
[head:09133] Signal: Segmentation fault (11)
[head:09133] Signal code: Address not mapped (1)
[head:09133] Failing at address: 0xb0127475
[head:09133] [ 0] [0xb7f0f440]
[head:09133] [ 1] /usr/lib/libmpi.so.0(ompi_convertor_pack+0x90)  
[0xb668f9a0]
[head:09133] [ 2] /usr/lib/openmpi/mca_btl_tcp.so 
(mca_btl_tcp_prepare_src+0x210) [0xb56daef0]
[head:09133] [ 3] /usr/lib/openmpi/mca_pml_ob1.so 
(mca_pml_ob1_send_request_schedule_exclusive+0x1de) [0xb5726ede]

[head:09133] [ 4] /usr/lib/openmpi/mca_pml_ob1.so [0xb5728238]
[head:09133] [ 5] /usr/lib/openmpi/mca_btl_tcp.so [0xb56ddc65]
[head:09133] [ 6] /usr/lib/libopen-pal.so.0(opal_event_base_loop 
+0x462) [0xb65bcf12]
[head:09133] [ 7] /usr/lib/libopen-pal.so.0(opal_event_loop+0x29)  
[0xb65bcfd9]
[head:09133] [ 8] /usr/lib/libopen-pal.so.0(opal_progress+0xc0)  
[0xb65b7260]
[head:09133] [ 9] /usr/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send 
+0x3e5) [0xb571f965]

[head:09133] [10] /usr/lib/libmpi.so.0(MPI_Send+0x12f) [0xb66abf0f]
[head:09133] [11] /opt/plato/release_1.0/bin/engine 
(_ZN2GP15MPIProcessGroup4sendERKNS_9MemoryMapEii+0xd9) [0x81cec03]
[head:09133] [12] /opt/plato/release_1.0/bin/engine 
(_ZN2GP15MPIProcessGroup4sendEN5boost10shared_ptrINS_6EntityEEEii 
+0x2d0) [0x81d0358]
[head:09133] [13] /opt/plato/release_1.0/bin/engine 
(_ZN2GP20ParallelDataAccessor4loadEN5boost10shared_ptrINS_6EntityEEE 
+0x23b) [0x853c939]
[head:09133] [14] /opt/plato/release_1.0/bin/engine 
(_ZN2GP12Transactions6createEPKN11xercesc_2_77DOMNodeE+0x57f)  
[0x8426553]
[head:09133] [15] /opt/plato/release_1.0/bin/engine 
(_ZN2GP7FactoryIN5boost10shared_ptrINS_7XmlBaseEEESsPFS4_PKN11xercesc_ 
2_77DOMNodeEENS_19DefaultFactoryErrorEE12createObjectES8_+0x76)  
[0x81ca06a]
[head:09133] [16] /opt/plato/release_1.0/bin/engine 
(_ZN2GP16XmlFactoryParser7descentEPN11xercesc_2_77DOMNodeEb+0x5b2)  
[0x81cd700]
[head:09133] [17] /opt/plato/release_1.0/bin/engine 
(_ZN2GP9XmlParser8traverseEb+0x278) [0x81c1eca]
[head:09133] [18] /opt/plato/release_1.0/bin/engine 
(_ZN2GP16XmlFactoryParser8traverseEb+0x19) [0x81c9eeb]
[head:09133] [19] /opt/plato/release_1.0/bin/engine(main+0x1d23)  
[0x81617f7]
[head:09133] [20] /lib/tls/i686/cmov/libc.so.6(__libc_start_main 
+0xc8) [0xb6348ea8]
[head:09133] [21] /opt/plato/release_1.0/bin/engine 
(__gxx_personality_v0+0x15d) [0x815a731]

[head:09133] *** End of error message ***
mpirun noticed that job rank 0 with PID 9133 on node head exited on  
signal 11 (Segmentation fault).

2 additional processes aborted (not shown)



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


"Half of what I say is meaningless; but I say it so that the other  
half may reach you"

  Kahlil Gibran




[OMPI users] Open MPI - Signal: Segmentation fault (11) Problem

2007-04-11 Thread Michael Gauckler
Dear Open MPI User's and Developers,

I encountered a problem with Open MPI when porting an application, which
successfully ran with LAM MPI and MPICH.

The program produces a segmentation fault (see [1] for the stack trace) when
doing the MPI_Send with the following arguments:

MPI_Send((void *)0, 1, datatype, rank, tag, comm_); 

The first argument seems to be wrong at first sight, but is correct because
the argument "datatype" is an MPI_Datatype, 
which describes the memory layout of the object to be sent and is
zero-based. The other arguments are as expected: one such object is sent to
rank "rank" with tag "tag" with the help of the communicator "comm_". The
MPI_Datatype is constructed programmatically from the objects member
definitions using MPI_Type_struct. The MPI types involved are solely 
MPI_DOUBLE and MPI_UNSIGNED_INT.

I can reproduce the problem with the stable 1.2 release as well as the
1.2.1a snapshot of Open MPI.
My OS is Linux with Kernel 2.6.18 (Debian Etch) running on standard Dual
Xeon Hardware with GigE.

I tried to reduce the amount of data sent by excluding some of the object's
members from the transmission. There does not seem to be a certain member or
type which causes the problem. There seems to be a limit of
members/data/size which determines  the success of the call. The "datatype"
structure describes the type and location of approx. 2'000'000 numbers. The
data itself is approx. 16MB (2M * 8 bytes/number assuming doubles), which I
expect not to cause any problem to a MPI implementation.

Thank you for hints, ideas or suggestions where the problem could be.

Regards, 
Michael

[1]

[head:09133] *** Process received signal ***
[head:09133] Signal: Segmentation fault (11)
[head:09133] Signal code: Address not mapped (1)
[head:09133] Failing at address: 0xb0127475
[head:09133] [ 0] [0xb7f0f440]
[head:09133] [ 1] /usr/lib/libmpi.so.0(ompi_convertor_pack+0x90)
[0xb668f9a0]
[head:09133] [ 2]
/usr/lib/openmpi/mca_btl_tcp.so(mca_btl_tcp_prepare_src+0x210) [0xb56daef0]
[head:09133] [ 3]
/usr/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_exclusive+
0x1de) [0xb5726ede]
[head:09133] [ 4] /usr/lib/openmpi/mca_pml_ob1.so [0xb5728238]
[head:09133] [ 5] /usr/lib/openmpi/mca_btl_tcp.so [0xb56ddc65]
[head:09133] [ 6] /usr/lib/libopen-pal.so.0(opal_event_base_loop+0x462)
[0xb65bcf12]
[head:09133] [ 7] /usr/lib/libopen-pal.so.0(opal_event_loop+0x29)
[0xb65bcfd9]
[head:09133] [ 8] /usr/lib/libopen-pal.so.0(opal_progress+0xc0) [0xb65b7260]
[head:09133] [ 9] /usr/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x3e5)
[0xb571f965]
[head:09133] [10] /usr/lib/libmpi.so.0(MPI_Send+0x12f) [0xb66abf0f]
[head:09133] [11]
/opt/plato/release_1.0/bin/engine(_ZN2GP15MPIProcessGroup4sendERKNS_9MemoryM
apEii+0xd9) [0x81cec03]
[head:09133] [12]
/opt/plato/release_1.0/bin/engine(_ZN2GP15MPIProcessGroup4sendEN5boost10shar
ed_ptrINS_6EntityEEEii+0x2d0) [0x81d0358]
[head:09133] [13]
/opt/plato/release_1.0/bin/engine(_ZN2GP20ParallelDataAccessor4loadEN5boost1
0shared_ptrINS_6EntityEEE+0x23b) [0x853c939]
[head:09133] [14]
/opt/plato/release_1.0/bin/engine(_ZN2GP12Transactions6createEPKN11xercesc_2
_77DOMNodeE+0x57f) [0x8426553]
[head:09133] [15]
/opt/plato/release_1.0/bin/engine(_ZN2GP7FactoryIN5boost10shared_ptrINS_7Xml
BaseEEESsPFS4_PKN11xercesc_2_77DOMNodeEENS_19DefaultFactoryErrorEE12createOb
jectES8_+0x76) [0x81ca06a]
[head:09133] [16]
/opt/plato/release_1.0/bin/engine(_ZN2GP16XmlFactoryParser7descentEPN11xerce
sc_2_77DOMNodeEb+0x5b2) [0x81cd700]
[head:09133] [17]
/opt/plato/release_1.0/bin/engine(_ZN2GP9XmlParser8traverseEb+0x278)
[0x81c1eca]
[head:09133] [18]
/opt/plato/release_1.0/bin/engine(_ZN2GP16XmlFactoryParser8traverseEb+0x19)
[0x81c9eeb]
[head:09133] [19] /opt/plato/release_1.0/bin/engine(main+0x1d23) [0x81617f7]
[head:09133] [20] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xc8)
[0xb6348ea8]
[head:09133] [21]
/opt/plato/release_1.0/bin/engine(__gxx_personality_v0+0x15d) [0x815a731]
[head:09133] *** End of error message ***
mpirun noticed that job rank 0 with PID 9133 on node head exited on signal
11 (Segmentation fault).
2 additional processes aborted (not shown)