Re: [OMPI devel] ORTE scalability issues

2007-04-17 Thread Ralph H Castain
Thanks Christian.

Actually, I was aware of that and should have clarified that these tests did
*not* involve the IPv6 code.

Ralph


On 4/17/07 1:31 AM, "Christian Kauhaus"  wrote:

> Ralph H Castain :
>> even though the HNP isn't actually part of the MPI job itself, or the
>> processes are opening duplicate OOB sockets back to the HNP. I am not
>> certain which (or either) of these is the root cause, however - it needs
>> further investigation to identify the source of the extra sockets.
> 
> If you are using the IPv6-ready code: in this case we need to create two
> sockets for each OOB/TCP. One uses AF_INET and one uses AF_INET6.
> IIRC, we close the superfluous socket once the connection attempt on
> either one succeeds. Adrian, correct me if I'm wrong. :-)
> Unfortunately, there's no easy way around this.
> 
> Christian




Re: [OMPI devel] ORTE scalability issues

2007-04-17 Thread Christian Kauhaus
Ralph H Castain :
>even though the HNP isn't actually part of the MPI job itself, or the
>processes are opening duplicate OOB sockets back to the HNP. I am not
>certain which (or either) of these is the root cause, however - it needs
>further investigation to identify the source of the extra sockets.

If you are using the IPv6-ready code: in this case we need to create two
sockets for each OOB/TCP. One uses AF_INET and one uses AF_INET6.
IIRC, we close the superfluous socket once the connection attempt on
either one succeeds. Adrian, correct me if I'm wrong. :-)
Unfortunately, there's no easy way around this.

Christian

-- 
Dipl.-Inf. Christian Kauhaus   <><
Lehrstuhl fuer Rechnerarchitektur und -kommunikation 
Institut fuer Informatik * Ernst-Abbe-Platz 1-2 * D-07743 Jena
Tel: +49 3641 9 46376  *  Fax: +49 3641 9 46372   *  Raum 3217


[OMPI devel] ORTE scalability issues

2007-04-16 Thread Ralph H Castain
Hello all

I understand that several people are interested in the OpenRTE scalability
issues - this is great! However, it appears we haven't done a very good job
of circulating information about the identified causes of the current
issues. In the hope of helping people to be productive in their
contributions, I thought it might be useful if we circulated both the info
and diagnoses to-date, as well as current remediation plans that have been
developed by those of us working on the issues so far and the status of
those efforts.

First, a quick recap so everyone starts from a common knowledge base. We
have performed roughly 4 scalability tests on OpenRTE/Open MPI over the last
two years. In each case, we had exclusive use of a large cluster so that we
could run large scale jobs - typically consisting of 500+ nodes and up to 8K
processes, operating under either SLURM or TM environments. We have also
received some scaling data from efforts at Sun involving Solaris-based
systems running under N1GE. The tests showed we could reliably launch to
about the 1K process level, but we encountered difficulties when extending
significantly beyond that point.

The scalability issues generally breakdown into two categories:

1. Memory consumption. We see a "spike" in memory usage by the HNP early in
the launch process that can be quite large. In the earliest tests, we saw
GBytes consumed during the launch of ~2K processes, with a steady-state
usage of ~20MBytes. The spike was caused by the copying of buffers during
transmission of OOB messages, combined with the large size of the STG1 xcast
message. This was corrected at the time (courtesy of Tim W) by having the
OOB *not* copy message buffers. Follow-on tests showed that the memory
"spike" had essentially disappeared.

However, recent tests indicate that this "fix" may have been lost, or we may
now be using a code path that bypasses it (we used to send the xcast
messages via blocking sends, but now use non-blocking sends, which do follow
a slightly different code path). Regardless of the reason, it appears that
the copying of buffers has returned, and OpenRTE once again exhibits the
GByte memory "spike" on large jobs.

Steady-state memory usage is driven by two things. First, we made a design
decision at the beginning of the project to provide maximum system
flexibility. Hence, there is no overarching control over the data being
stored within the system (specifically, within the GPR framework), and each
component/framework is free to store whatever its author wants. Given the
free-lance nature of the development of these components, there is some
non-trivial duplication of information on the GPR. However, if you add all
that up, it doesn't amount to a very large number (on the order of a few
megabytes for large scales). On machines of interest to those of us working
on the code at the time, the steady-state memory footprint was not a
priority issue - hence, little has been done to reduce it.


2. OOB communications. There are two primary issues in this category, both
of which lead back to the same core problem. First, the number of TCP socket
connections back to the HNP grows linearly with the number of processes. In
the most recent tests, Galen reported ~20K sockets being opened on the HNP
for an 8K process job running on 4K nodes. If you look at the code, you will
find that (a) 8K of those sockets are due to each MPI process connecting
directly back to the HNP, and (b) 4K of those sockets are due to the orteds
on each node connecting back to the HNP. The other 8K sockets appear to be
due to a "bug" in the code: from what I can tell so far, it appears that
either the MPI layer's BTL/TCP component is opening a socket to the HNP,
even though the HNP isn't actually part of the MPI job itself, or the
processes are opening duplicate OOB sockets back to the HNP. I am not
certain which (or either) of these is the root cause, however - it needs
further investigation to identify the source of the extra sockets.

The large number of sockets on the HNP causes timeout problems during the
initial "connection storm" as processes start up, which subsequently causes
the MPI job to go into "retry hell". To help relieve that problem, a
"listener thread" was introduced on the HNP (courtesy of Brian) that could
absorb the connections at a much higher rate. This has now been debugged in
the current OMPI trunk (and I believe was just moved to 1.2.1 for release).

The second issue is the time it takes to transmit the various stage gate
messages from the HNP to each MPI process. The only stage gate of concern
here is STG1 since that is where substantial data is involved (we send info
required to inform each process of its peers for interconnect purposes). The
current OMPI trunk uses a "direct" method - i.e., the HNP sends the stage
gate messages directly to each MPI process.

We have already implemented two measures to help reduce this part of the
problem. First, late last year we revised the G