Marcin,
Based on your description, you might want to investigate non blocking
collectives (e.g. MPI_Ibcast) or even upcoming persistent collectives
(e.g. MPIX_Bcast_init).
If you know the address of the receive buffer, then you can MPI_Ibcast()
on non root ranks very early, and then
21, 2019 7:31 PM
*To:* Open MPI Developers
*Subject:* Re: [OMPI devel] Memory performance with Bcast
Marcin,
I am not sure I understand your question, a bcast is a collective
operation that must be posted by all participants. Independently at
what level the bcast is serviced, if some
On 3/21/19 5:31 PM, George Bosilca wrote:
I am not sure I understand your question, a bcast is a collective
operation that must be posted by all participants. Independently at
what level the bcast is serviced, if some of the participants have not
posted their participation to the collective,
/get semantics) but this will
require some study if you are not familiar with it.
From: devel On Behalf Of George Bosilca
Sent: Thursday, March 21, 2019 7:31 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] Memory performance with Bcast
Marcin,
I am not sure I understand your question
Marcin,
I am not sure I understand your question, a bcast is a collective operation
that must be posted by all participants. Independently at what level the
bcast is serviced, if some of the participants have not posted their
participation to the collective, only partial progress can be made.
Marcin,
HPC-X implements the MPI BCAST operation by leveraging hardware multicast
capabilities. Starting with HPC-X v2.3 we introduced a new multicast based
algorithm for large messages as well. Hardware multicast scales as O(1)
modulo switch hops. It is the most efficient way to broadcast a
Thanks, George! So, the function you mentioned is used when I turn off
HCOLL and use OpenMPI's tuned coll instead. That helps a lot. Another
thing that makes me think is that in my case the data is sent to the
targets asynchronously, or rather - it is a 'put' operation in nature,
and the
If you have support for FCA then it might happen that the collective will
use the hardware support. In any case, most of the bcast algorithms have a
logarithmic behavior, so there will be at most O(log(P)) memory accesses on
the root.
If you want to take a look at the code in OMPI to understand
Hi!
I'm wondering about the details of Bcast implementation in OpenMPI. I'm
specifically interested in IB interconnects, but information about other
architectures (and OpenMPI in general) would also be very useful.
I am working with a code, which sends the sameĀ (large) message to a
bunch