This note explores memory profiling, e.g., as discussed in an e-mail thread. Here, I show an example of how to distinguish between heavy user memory allocation versus due to allocation within the MPI implementation due to resource congestion.
The particular example I present uses Sun tools (Sun Studio compilers and tools, as well as Sun HPC ClusterTools, which is an Open MPI distribution). These tools are available for free download on Linux and Solaris systems and on x86 and SPARC processors. Other tools do similar things, though I am not familiar with them. For MPI tracing, there are tools like Intel Trace Analyzer, Vampir, Jumpshot, etc. For memory profiling, there are tools like Valgrind.
My sample program had rank 0 send a million short messages to rank 1, who received them. The twist is that the receiver might sleep a few seconds before starting to receive. The program also allocates some memory in some function foo1().
I ran the program, asking Sun Studio to trace MPI messages and heap allocations. I could then look at the run using the Performance Analyzer.
Here is information on the top routines. In this case, I used the command-line analysis tool since I didn't want to have to generate more screenshots. We can clearly distinguish between memory allocation by the user program and memory allocation within the MPI implementation. Most of the functions show the same allocation activity, regardless of whether the receiver sleeps before catching up on in-coming traffic. The message-passing calls, however, skyrocket, with most of the allocation on the send side.
no sleep sleep
------------------- -------------------
bytes #allocs bytes #allocs
MPI_Init 414444765 16251 414444765 16251
foo1 134217728 2 134217728 2
MPI_Send 6743210 563 492279586 58725
MPI_Recv 5325230 138 71696450 2385
MPI_Finalize 11598 153 11598 153
Now, here are screenshots. They show message lines on a timeline display as well as a plot of the number of in-flight messages as a function of elapsed time. The "no sleep" (active) receiver case shows a steady backlog of about 1300 messages during the message-passing period. In contrast, the sleeping-receiver case shows a backlog of up to 85,000 messages, with relief occurring because the sender stalls periodically.
| receiver is ready | receiver sleeps first | |
|---|---|---|
| timeline | ![]() |
![]() |
| plot of # of in-flight messages as a function of elapsed time |
![]() |
![]() |