I will try to take a look at it today. -Nathan
> On May 9, 2019, at 12:37 AM, Joseph Schuchart via users > <users@lists.open-mpi.org> wrote: > > Nathan, > > Over the last couple of weeks I made some more interesting observations > regarding the latencies of accumulate operations on both Aries and InfiniBand > systems: > > 1) There seems to be a significant difference between 64bit and 32bit > operations: on Aries, the average latency for compare-exchange on 64bit > values takes about 1.8us while on 32bit values it's at 3.9us, a factor of > >2x. On the IB cluster, all of fetch-and-op, compare-exchange, and accumulate > show a similar difference between 32 and 64bit. There are no differences > between 32bit and 64bit puts and gets on these systems. > > 2) On both systems, the latency for a single-value atomic load using > MPI_Fetch_and_op + MPI_NO_OP is 2x that of MPI_Fetch_and_op + MPI_SUM on > 64bit values, roughly matching the latency of 32bit compare-exchange > operations. > > All measurements were done using Open MPI 3.1.2 with > OMPI_MCA_osc_rdma_acc_single_intrinsic=true. Is that behavior expected as > well? > > Thanks, > Joseph > > > On 11/6/18 6:13 PM, Nathan Hjelm via users wrote: >> All of this is completely expected. Due to the requirements of the standard >> it is difficult to make use of network atomics even for MPI_Compare_and_swap >> (MPI_Accumulate and MPI_Get_accumulate spoil the party). If you want >> MPI_Fetch_and_op to be fast set this MCA parameter: >> osc_rdma_acc_single_intrinsic=true >> Shared lock is slower than an exclusive lock because there is an extra lock >> step as part of the accumulate (it isn't needed if there is an exclusive >> lock). When setting the above parameter you are telling the implementation >> that you will only be using a single count and we can optimize that with the >> hardware. The RMA working group is working on an info key that will >> essentially do the same thing. >> Note the above parameter won't help you with IB if you are using UCX unless >> you set this (master only right now): >> btl_uct_transports=dc_mlx5 >> btl=self,vader,uct >> osc=^ucx >> Though there may be a way to get osc/ucx to enable the same sort of >> optimization. I don't know. >> -Nathan >> On Nov 06, 2018, at 09:38 AM, Joseph Schuchart <schuch...@hlrs.de> wrote: >>> All, >>> >>> I am currently experimenting with MPI atomic operations and wanted to >>> share some interesting results I am observing. The numbers below are >>> measurements from both an IB-based cluster and our Cray XC40. The >>> benchmarks look like the following snippet: >>> >>> ``` >>> if (rank == 1) { >>> uint64_t res, val; >>> for (size_t i = 0; i < NUM_REPS; ++i) { >>> MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win); >>> MPI_Win_flush(target, win); >>> } >>> } >>> MPI_Barrier(MPI_COMM_WORLD); >>> ``` >>> >>> Only rank 1 performs atomic operations, rank 0 waits in a barrier (I >>> have tried to confirm that the operations are done in hardware by >>> letting rank 0 sleep for a while and ensuring that communication >>> progresses). Of particular interest for my use-case is fetch_op but I am >>> including other operations here nevertheless: >>> >>> * Linux Cluster, IB QDR * >>> average of 100000 iterations >>> >>> Exclusive lock, MPI_UINT32_T: >>> fetch_op: 4.323384us >>> compare_exchange: 2.035905us >>> accumulate: 4.326358us >>> get_accumulate: 4.334831us >>> >>> Exclusive lock, MPI_UINT64_T: >>> fetch_op: 2.438080us >>> compare_exchange: 2.398836us >>> accumulate: 2.435378us >>> get_accumulate: 2.448347us >>> >>> Shared lock, MPI_UINT32_T: >>> fetch_op: 6.819977us >>> compare_exchange: 4.551417us >>> accumulate: 6.807766us >>> get_accumulate: 6.817602us >>> >>> Shared lock, MPI_UINT64_T: >>> fetch_op: 4.954860us >>> compare_exchange: 2.399373us >>> accumulate: 4.965702us >>> get_accumulate: 4.977876us >>> >>> There are two interesting observations: >>> a) operations on 64bit operands generally seem to have lower latencies >>> than operations on 32bit >>> b) Using an exclusive lock leads to lower latencies >>> >>> Overall, there is a factor of almost 3 between SharedLock+uint32_t and >>> ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate >>> (compare_exchange seems to be somewhat of an outlier). >>> >>> * Cray XC40, Aries * >>> average of 100000 iterations >>> >>> Exclusive lock, MPI_UINT32_T: >>> fetch_op: 2.011794us >>> compare_exchange: 1.740825us >>> accumulate: 1.795500us >>> get_accumulate: 1.985409us >>> >>> Exclusive lock, MPI_UINT64_T: >>> fetch_op: 2.017172us >>> compare_exchange: 1.846202us >>> accumulate: 1.812578us >>> get_accumulate: 2.005541us >>> >>> Shared lock, MPI_UINT32_T: >>> fetch_op: 5.380455us >>> compare_exchange: 5.164458us >>> accumulate: 5.230184us >>> get_accumulate: 5.399722us >>> >>> Shared lock, MPI_UINT64_T: >>> fetch_op: 5.415230us >>> compare_exchange: 1.855840us >>> accumulate: 5.212632us >>> get_accumulate: 5.396110us >>> >>> >>> The difference between exclusive and shared lock is about the same as >>> with IB and the latencies for 32bit vs 64bit are roughly the same >>> (except for compare_exchange, it seems). >>> >>> So my question is: is this to be expected? Is the higher latency when >>> using a shared lock caused by an internal lock being acquired because >>> the hardware operations are not actually atomic? >>> >>> I'd be grateful for any insight on this. >>> >>> Cheers, >>> Joseph >>> >>> -- >>> Dipl.-Inf. Joseph Schuchart >>> High Performance Computing Center Stuttgart (HLRS) >>> Nobelstr. 19 >>> D-70569 Stuttgart >>> >>> Tel.: +49(0)711-68565890 >>> Fax: +49(0)711-6856832 >>> E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>> https://lists.open-mpi.org/mailman/listinfo/users >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users