On 05/02/2011 02:04 PM, Robert Walters wrote:
Terry,
I was under the impression that all connections are made because of
the nature of the program that OpenMPI is invoking. LS-DYNA is a
finite element solver and for any given simulation I run, the cores on
each node must constantly communicate with one another to check for
various occurrences (contact with various pieces/parts, updating nodal
coordinates, etc...).
You might be right, the connections might have been established but the
error message you state (connection refused) seems out of place if the
connection was already established.
Was there more error messages from OMPI other than "connection
refused"? If so could you possibly provide that output to us, maybe it
will give us a hint where in the library things are messing up.
I've run the program using --mca mpi_preconnect_mpi 1 and the
simulation has started itself up successfully which I think means that
the mpi_preconnect passed since all of the child processes have
started up on each individual node. Thanks for the suggestion though,
it's a good place to start.
Yeah, it possibly could be telling if things do work with this setting.
I've been worried (though I have no basis for it) that messages may be
getting queued up and hitting some kind of ceiling or timeout. As a
finite element code, I think the communication occurs on a large
scale. Lots of very small packets going back and forth quickly. A few
studies have been done by the High Performance Computing Advisory
Council
(http://www.hpcadvisorycouncil.com/pdf/LS-DYNA%20_analysis.pdf) and
they've suggested that LS-DYNA communicates at very, very high rates
(Not sure but from pg.15 of that document they're suggesting hundreds
of millions of messages in only a few hours). Is there any kind of
buffer or queue that OpenMPI develops if messages are created too
quickly? Does it dispatch them immediately or does it attempt to apply
some kind of traffic flow control?
The queuing really depends on what type of calls the application is
making. If it is doing blocking sends then I wouldn't expect too much
queuing happening using the tcp btl. As far as traffic flow control is
concerned I believe the tcp btl doesn't do any for the most part and
lets tcp handle that. Maybe someone else on the list could chime in if
I am wrong here.
In the past I have seen where lots of traffic on the network and to a
particular node has cause some connections not to be established. But I
don't know of any outstanding issue with such issues right now.
--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>