Shaun Jackman wrote:

Wow. Thanks, Eugene. I definitely have to look into the Sun HPC ClusterTools. It looks as though it could be very informative.

Great. And, I didn't mean to slight TotalView. I'm just not familiar with it.

What's the purpose of the 400 MB that MPI_Init has allocated?

It's for... um, I don't know.  Let's see...

About a third of it appears to be
vt_open() -> VTThrd_open() -> VTGen_open
which I'm guessing is due to the VampirTrace instrumentation (maybe allocating the buffers into which the MPI tracing data is collected). It seems to go away if one doesn't collect message-tracing data.

Somehow, I can't see further into the library. Hmm. It does seem like a bunch. The shared-memory area (which MPI_Init allocates for on-node message passing) is much smaller. The remaining roughly 130 Mbyte/process seems to be independent of the number of processes.

An interesting exercise for the reader.

The figure of in-flight messages vs time when the receiver sleeps is particularly interesting. The sender appears to stop sending and block once there are 30'000 in-flight messages. Has Open MPI detected the situation of congestion and begun waiting for the receiver to catch up? Or is it something simpler, such as the underlying write(2) call to the TCP socket blocking? If it's the first case, perhaps I could tune this threshold to behave better for my application.

This particular case is for two on-node processes. So, no TCP is involved. There appear to be about 55K allocations, which looks like the 85K peak minus the 30K at which the sender stalls. So, maybe some resource got exhausted at that point. Dunno.

Anyhow, this may be starting to get into more detail than you (or I) need to understand to address your problem. It *is* interesting stuff, though.

Reply via email to