Hi Eugene,
Eugene Loh wrote:
At 2500 bytes, all messages will presumably be sent "eagerly" -- without
waiting for the receiver to indicate that it's ready to receive that
particular message. This would suggest congestion, if any, is on the
receiver side. Some kind of congestion could, I suppose, still occur
and back up on the sender side.
Can anyone chime in as to what the message size limit is for an
`eager' transmission?
On the other hand, I assume the memory imbalance we're talking about is
rather severe. Much more than 2500 bytes to be noticeable, I would
think. Is that really the situation you're imagining?
The memory imbalance is drastic. I'm expecting 2 GB of memory use per
process. The heaving processes (13/16) use the expected amount of
memory; the remainder (3/16) misbehaving processes use more than twice
as much memory. The specifics vary from run to run of course. So, yes,
there is gigs of unexpected memory use to track down.
There are tracing tools to look at this sort of thing. The only one I
have much familiarity with is Sun Studio / Sun HPC ClusterTools. Free
download, available on Solaris or Linux, SPARC or x64, plays with OMPI.
You can see a timeline with message lines on it to give you an idea if
messages are being received/completed long after they were sent.
Another interesting view is constructing a plot vs time of how many
messages are in-flight at any moment (including as a function of
receiver). Lots of similar tools out there... VampirTrace (tracing side
only, need to analyze the data), Jumpshot, etc. Again, though, there's
a question in my mind if you're really backing up 1000s or more of
messages. (I'm assuming the memory imbalances are at least Mbytes.)
I'll check out Sun HPC ClusterTools. Thanks for the tip.
Assuming the problem is congestion and that messages are backing up,
is there an accepted method of dealing with this situation? It seems
to me the general approach would be
if (number of outstanding messages > high water mark)
wait until (number of outstanding messages < low water mark)
where I suppose the `number of outstanding messages' is defined as the
number of messages that have been sent and not yet received by the
other side. Is there a way to get this number from MPI without having
to code it at the application level?
Thanks,
Shaun