Shaun Jackman wrote:
Wow. Thanks, Eugene. I definitely have to look into the Sun HPC
ClusterTools. It looks as though it could be very informative.
Great. And, I didn't mean to slight TotalView. I'm just not familiar
with it.
What's the purpose of the 400 MB that MPI_Init has allocated?
It's for... um, I don't know. Let's see...
About a third of it appears to be
vt_open() -> VTThrd_open() -> VTGen_open
which I'm guessing is due to the VampirTrace instrumentation (maybe
allocating the buffers into which the MPI tracing data is collected).
It seems to go away if one doesn't collect message-tracing data.
Somehow, I can't see further into the library. Hmm. It does seem like
a bunch. The shared-memory area (which MPI_Init allocates for on-node
message passing) is much smaller. The remaining roughly 130
Mbyte/process seems to be independent of the number of processes.
An interesting exercise for the reader.
The figure of in-flight messages vs time when the receiver sleeps is
particularly interesting. The sender appears to stop sending and block
once there are 30'000 in-flight messages. Has Open MPI detected the
situation of congestion and begun waiting for the receiver to catch
up? Or is it something simpler, such as the underlying write(2) call
to the TCP socket blocking? If it's the first case, perhaps I could
tune this threshold to behave better for my application.
This particular case is for two on-node processes. So, no TCP is
involved. There appear to be about 55K allocations, which looks like
the 85K peak minus the 30K at which the sender stalls. So, maybe some
resource got exhausted at that point. Dunno.
Anyhow, this may be starting to get into more detail than you (or I)
need to understand to address your problem. It *is* interesting stuff,
though.