Scaling tests over the last few months have all shown a behavior that has elicited significant comment: namely, that the HNP is observed to grow to multiple gigabytes in size for runs involving several thousand processes. This represents a peak size that declines to a much smaller footprint once the application has been launched.
Given the degree of concern expressed over this behavior, I thought I would once again provide the explanation for it. I believe I have sent emails out about this before, but I know it can be difficult for people who don't work regularly with OpenRTE to remember those notes after time has passed. The observed memory spike is caused by the way we handle the STG1 stage gate message sent to all processes. There are two contributing factors that specifically control this behavior on the current Open MPI releases and development trunk: 1. we send stage gate messages directly to each process. Thus, for N processes, there are N messages queued for transmission at the HNP; and 2. we use non-blocking RML/OOB send commands to do the communication. Note that we used to do blocking sends, but for speed purposes converted over to non-blocking sends late last year. This is a critical point to understanding the behavior, as you'll see in a moment. The key to the memory spike lies in knowing that the RML/OOB actually *copies* the buffer given to it for transmission, then inserts the comm request into its queue for transmission as network access permits. When we used blocking sends, we only had *one* message in the queue at any time - hence, the memory footprint of the HNP remained small. However, when we converted to non-blocking sends, we have N messages in the queue. Thus, there are now *N* copies of the message buffer being made inside the HNP! As transmission of each message is completed, the corresponding copy of the data is released. Hence, the HNP's footprint gradually reduces as the communication is completed. Once the STG1 stage gate is passed, the footprint is back to a relatively small number. One could question why the copy is being done at all. Well, when the original author of the RML/OOB wrote that code, he was concerned that callers might not retain the provided message buffer until *after* the communication had been completed. This was particularly of concern for non-blocking sends since the send call immediately returns, but the message may not actually be sent for some unknowable time into the future. In addition, there are numerous places in the code where someone will create a single message buffer and then send it to multiple recipients using non-blocking sends. This buffer is then released once the send commands have been *issued* - but that doesn't mean that the messages have actually been sent! Of course, we could require that the buffer be retained until the communication is complete, but that would add complexity to the code in the caller's routine - and we opted to avoid that necessity. Of course, we can revisit these design decisions in light of how we are currently using the system. Perhaps we *should* require the caller to maintain the buffer throughout the communication, and force the caller to deal with the associated code complexity. Note that the obvious solution of just creating a new buffer for each send and then releasing it in the corresponding callback function would solve nothing - we would just be moving the copy function from the RML/OOB to the caller's function. I have seen this done in a few places in the code, but all that did was cause us to generate *two* copies of each message. So we would have to rely on the caller to be clever about buffer management to make any such change work. Anyway, that is why the HNP is behaving as observed. Please note that this will automatically improve once we turn "on" the more advanced xcast modes as the number of messages being queued at the HNP will dramatically decline. It won't change how the RML/OOB work, but it will reduce the footprint issue. I hope that helps clarify and, perhaps, generate some useful thoughts on alternative approaches. Ralph