Hi, has this deadlock been fixed in the 1.3 source yet?
Thanks,
Justin
Jeff Squyres wrote:
On Dec 11, 2008, at 5:30 PM, Justin wrote:
The more I look at this bug the more I'm convinced it is with openMPI
and not our code. Here is why: Our code generates a
communication/execution schedule. At each timestep this schedule is
executed and all communication and execution is performed. Our
problem is AMR which means the communication schedule may change from
time to time. In this case the schedule has not changed in many
timesteps meaning the same communication schedule is being used as
the last X (x being around 20 in this case) timesteps.
Our code does have a very large communication problem. I have been
able to reduce the hang down to 16 processors and it seems to me the
hang occurs when he have lots of work per processor. Meaning if I
add more processors it may not hang but reducing processors makes it
more likely to hang.
What is the status on the fix for this particular freelist deadlock?
George is actively working on it because it is the "last" issue
blocking us from releasing v1.3. I fear that if he doesn't get it
fixed by tonight, we'll have to push v1.3 to next year (see
http://www.open-mpi.org/community/lists/devel/2008/12/5029.php and
http://www.open-mpi.org/community/lists/users/2008/12/7499.php).