On Dec 14, 2012, at 4:31 PM, Handerson, Steven wrote:

> I’m trying to track down an instance of openMPI writing to a freed block of 
> memory.
> This occurs with the most recent release (1.6.3) as well as 1.6, on a 64 bit 
> intel architecture, fedora 14.
> It occurs with a very simple reduction (allreduce minimum), over a single int 
> value.

Can you send a reproducer program?  The simpler, the better.

> I’m wondering if the openMPI developers use power tools such as valgrind / 
> dmalloc / etc
> on the releases to try to catch these things via exhaustive testing –
> but I understand memory problems in C are of the nature that anyone making a 
> mistake can propogate,
> so I haven’t ruled out problems in our own code.
> Also, I’m wondering if anyone has suggestions on how to track this down 
> further.

Yes, we do use such tools.

Can you cite the specific file/line where the problem is occurring?  The all 
reduce algorithms are fairly self-contained; it should be (relatively) 
straightforward to examine that code and see if there's a problem with the 
memory allocation there.

> I’m using allinea DDT and their builtin dmalloc, which catches the error, 
> which appears in
> the second memcpy in  opal_convertor_pack(), but I don’t have more details 
> than that at the moment.
> All I know so far is that one of those values has been freed.
> Obviously, I haven’t seen anything in earlier parts of the code which might 
> have triggered memory corruption,
> although both openMPI and intel IPP do things with uninitialized values 
> before this (according to Valgrind).

There's a number of issues that can lead to false positives for using 
uninitialized values.  Here's two of the most common cases:

1. When using TCP, one of our data headers has a padding hole in it, but we 
write the whole struct down a TCP socket file descriptor anyway.  Hence, it 
will generate a "read from uninit" warning.

2. When using OpenFabrics-based networks, tool like valgrind don't see the 
OS-bypass initialization of the memory (Which frequently comes directly from 
the hardware), and it generates a lot of false "read from uninit" positives.

One thing you can try is to compile Open MPI --with-valgrind.  This adds a 
little performance penalty, but we take extra steps to eliminate most false 
positives.  It could help separate the wheat from the chaff, in your case.

Jeff Squyres
For corporate legal information go to: 

Reply via email to