Re: [OMPI users] Begginers question: why does this program hangs?

Jeff Squyres Tue, 18 Mar 2008 10:51:33 -0400

On Mar 18, 2008, at 10:32 AM, George Bosilca wrote:

Jeff hinted the real problem in his email. Even if the program usethe correct MPI functions, it is not 100% correct.

I think we disagree here -- the sample program is correct according tothe MPI spec. It's an implementation artifact that makes it deadlock.

The upcoming v1.3 series doesn't suffer from this issue; we revampedour transport system to distinguish between early and normalcompletions. The pml_ob1_use_eager_completion MCA param was added tov1.2.6 to allow correct MPI apps to avoid this optimization -- aproper fix is coming in the v1.3 series.

It might pass in some situations, but can lead to fake "deadlocks"in others. The problem come from the flow control. If the messagesare small (which is the case in the test example), Open MPI willsend them eagerly. Without a flow control, these messages will bebuffered by the receiver, which will exhaust the memory on thereceiver. Once this happens, some of the messages may get dropped,but the most visible result, is that the progress will happens very(VERY) slowly.

Your text implies that we can actually *drop* (and retransmit)messages in the sm btl. That doesn't sound right to me -- is thatwhat you meant?


--
Jeff Squyres
Cisco Systems

Re: [OMPI users] Begginers question: why does this program hangs?

Reply via email to