Re: [OMPI users] Latencies of atomic operations on high-performance networks

Nathan Hjelm via users Thu, 09 May 2019 06:03:26 -0700

I will try to take a look at it today.

-Nathan


> On May 9, 2019, at 12:37 AM, Joseph Schuchart via users 
> <users@lists.open-mpi.org> wrote:
> 
> Nathan,
> 
> Over the last couple of weeks I made some more interesting observations 
> regarding the latencies of accumulate operations on both Aries and InfiniBand 
> systems:
> 
> 1) There seems to be a significant difference between 64bit and 32bit 
> operations: on Aries, the average latency for compare-exchange on 64bit 
> values takes about 1.8us while on 32bit values it's at 3.9us, a factor of 
> >2x. On the IB cluster, all of fetch-and-op, compare-exchange, and accumulate 
> show a similar difference between 32 and 64bit. There are no differences 
> between 32bit and 64bit puts and gets on these systems.
> 
> 2) On both systems, the latency for a single-value atomic load using 
> MPI_Fetch_and_op + MPI_NO_OP is 2x that of MPI_Fetch_and_op + MPI_SUM on 
> 64bit values, roughly matching the latency of 32bit compare-exchange 
> operations.
> 
> All measurements were done using Open MPI 3.1.2 with 
> OMPI_MCA_osc_rdma_acc_single_intrinsic=true. Is that behavior expected as 
> well?
> 
> Thanks,
> Joseph
> 
> 
> On 11/6/18 6:13 PM, Nathan Hjelm via users wrote:
>> All of this is completely expected. Due to the requirements of the standard 
>> it is difficult to make use of network atomics even for MPI_Compare_and_swap 
>> (MPI_Accumulate and MPI_Get_accumulate spoil the party). If you want 
>> MPI_Fetch_and_op to be fast set this MCA parameter:
>> osc_rdma_acc_single_intrinsic=true
>> Shared lock is slower than an exclusive lock because there is an extra lock 
>> step as part of the accumulate (it isn't needed if there is an exclusive 
>> lock). When setting the above parameter you are telling the implementation 
>> that you will only be using a single count and we can optimize that with the 
>> hardware. The RMA working group is working on an info key that will 
>> essentially do the same thing.
>> Note the above parameter won't help you with IB if you are using UCX unless 
>> you set this (master only right now):
>> btl_uct_transports=dc_mlx5
>> btl=self,vader,uct
>> osc=^ucx
>> Though there may be a way to get osc/ucx to enable the same sort of 
>> optimization. I don't know.
>> -Nathan
>> On Nov 06, 2018, at 09:38 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:
>>> All,
>>> 
>>> I am currently experimenting with MPI atomic operations and wanted to
>>> share some interesting results I am observing. The numbers below are
>>> measurements from both an IB-based cluster and our Cray XC40. The
>>> benchmarks look like the following snippet:
>>> 
>>> ```
>>> if (rank == 1) {
>>> uint64_t res, val;
>>> for (size_t i = 0; i < NUM_REPS; ++i) {
>>> MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win);
>>> MPI_Win_flush(target, win);
>>> }
>>> }
>>> MPI_Barrier(MPI_COMM_WORLD);
>>> ```
>>> 
>>> Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
>>> have tried to confirm that the operations are done in hardware by
>>> letting rank 0 sleep for a while and ensuring that communication
>>> progresses). Of particular interest for my use-case is fetch_op but I am
>>> including other operations here nevertheless:
>>> 
>>> * Linux Cluster, IB QDR *
>>> average of 100000 iterations
>>> 
>>> Exclusive lock, MPI_UINT32_T:
>>> fetch_op: 4.323384us
>>> compare_exchange: 2.035905us
>>> accumulate: 4.326358us
>>> get_accumulate: 4.334831us
>>> 
>>> Exclusive lock, MPI_UINT64_T:
>>> fetch_op: 2.438080us
>>> compare_exchange: 2.398836us
>>> accumulate: 2.435378us
>>> get_accumulate: 2.448347us
>>> 
>>> Shared lock, MPI_UINT32_T:
>>> fetch_op: 6.819977us
>>> compare_exchange: 4.551417us
>>> accumulate: 6.807766us
>>> get_accumulate: 6.817602us
>>> 
>>> Shared lock, MPI_UINT64_T:
>>> fetch_op: 4.954860us
>>> compare_exchange: 2.399373us
>>> accumulate: 4.965702us
>>> get_accumulate: 4.977876us
>>> 
>>> There are two interesting observations:
>>> a) operations on 64bit operands generally seem to have lower latencies
>>> than operations on 32bit
>>> b) Using an exclusive lock leads to lower latencies
>>> 
>>> Overall, there is a factor of almost 3 between SharedLock+uint32_t and
>>> ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate
>>> (compare_exchange seems to be somewhat of an outlier).
>>> 
>>> * Cray XC40, Aries *
>>> average of 100000 iterations
>>> 
>>> Exclusive lock, MPI_UINT32_T:
>>> fetch_op: 2.011794us
>>> compare_exchange: 1.740825us
>>> accumulate: 1.795500us
>>> get_accumulate: 1.985409us
>>> 
>>> Exclusive lock, MPI_UINT64_T:
>>> fetch_op: 2.017172us
>>> compare_exchange: 1.846202us
>>> accumulate: 1.812578us
>>> get_accumulate: 2.005541us
>>> 
>>> Shared lock, MPI_UINT32_T:
>>> fetch_op: 5.380455us
>>> compare_exchange: 5.164458us
>>> accumulate: 5.230184us
>>> get_accumulate: 5.399722us
>>> 
>>> Shared lock, MPI_UINT64_T:
>>> fetch_op: 5.415230us
>>> compare_exchange: 1.855840us
>>> accumulate: 5.212632us
>>> get_accumulate: 5.396110us
>>> 
>>> 
>>> The difference between exclusive and shared lock is about the same as
>>> with IB and the latencies for 32bit vs 64bit are roughly the same
>>> (except for compare_exchange, it seems).
>>> 
>>> So my question is: is this to be expected? Is the higher latency when
>>> using a shared lock caused by an internal lock being acquired because
>>> the hardware operations are not actually atomic?
>>> 
>>> I'd be grateful for any insight on this.
>>> 
>>> Cheers,
>>> Joseph
>>> 
>>> -- 
>>> Dipl.-Inf. Joseph Schuchart
>>> High Performance Computing Center Stuttgart (HLRS)
>>> Nobelstr. 19
>>> D-70569 Stuttgart
>>> 
>>> Tel.: +49(0)711-68565890
>>> Fax: +49(0)711-6856832
>>> E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Latencies of atomic operations on high-performance networks

Reply via email to