Folks, I'm trying to track down an instance of openMPI writing to a freed block of memory. This occurs with the most recent release (1.6.3) as well as 1.6, on a 64 bit intel architecture, fedora 14. It occurs with a very simple reduction (allreduce minimum), over a single int value.
Has anyone had any recent problems like this? It may be showing up as an intermittent error (i.e. there's no problem as long as the allocated block hasn't been re-allocated, which depends upon malloc). You may not know about it unless you've been debugging malloc with valgrind or dmalloc or the like. I'm wondering if the openMPI developers use power tools such as valgrind / dmalloc / etc on the releases to try to catch these things via exhaustive testing - but I understand memory problems in C are of the nature that anyone making a mistake can propogate, so I haven't ruled out problems in our own code. Also, I'm wondering if anyone has suggestions on how to track this down further. I'm using allinea DDT and their builtin dmalloc, which catches the error, which appears in the second memcpy in opal_convertor_pack(), but I don't have more details than that at the moment. All I know so far is that one of those values has been freed. Obviously, I haven't seen anything in earlier parts of the code which might have triggered memory corruption, although both openMPI and intel IPP do things with uninitialized values before this (according to Valgrind). Steve H.