Quick scan of the program and it looks ok to me. I will dig deeper and see if I can determine the underlying cause.
What Open MPI version are you using?
-Nathan
On Nov 08, 2018, at 11:10 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:
While using the mca parameter in a real application I noticed a strange
effect, which took me a while to figure out: It appears that on the
Aries network the accumulate operations are not atomic anymore. I am
attaching a test program that shows the problem: all but one processes
continuously increment a counter while rank 0 is continuously
subtracting a large value and adding it again, eventually checking for
the correct number of increments. Without the mca parameter the test at
the end succeeds as all increments are accounted for:
```
$ mpirun -n 16 -N 1 ./mpi_fetch_op_local_remote
result:15000
```
When setting the mca parameter the test fails with garbage in the result:
```
$ mpirun --mca osc_rdma_acc_single_intrinsic true -n 16 -N 1
./mpi_fetch_op_local_remote
result:25769849013
mpi_fetch_op_local_remote: mpi_fetch_op_local_remote.c:97: main:
Assertion `sum == 1000*(comm_size-1)' failed.
```
All processes perform only MPI_Fetch_and_op in combination with MPI_SUM
so I assume that the test in combination with the mca flag is correct. I
cannot reproduce this issue on our IB cluster.
Is that an issue in Open MPI or is there some problem in the test case
that I am missing?
Thanks in advance,
Joseph
On 11/6/18 1:15 PM, Joseph Schuchart wrote:Thanks a lot for the quick reply, settingosc_rdma_acc_single_intrinsic=true does the trick for both shared andexclusive locks and brings it down to <2us per operation. I hope thatthe info key will make it into the next version of the standard, Icertainly have use for it :)Cheers,JosephOn 11/6/18 12:13 PM, Nathan Hjelm via users wrote:All of this is completely expected. Due to the requirements of thestandard it is difficult to make use of network atomics even forMPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil theparty). If you want MPI_Fetch_and_op to be fast set this MCA parameter:osc_rdma_acc_single_intrinsic=trueShared lock is slower than an exclusive lock because there is an extralock step as part of the accumulate (it isn't needed if there is anexclusive lock). When setting the above parameter you are telling theimplementation that you will only be using a single count and we canoptimize that with the hardware. The RMA working group is working onan info key that will essentially do the same thing.Note the above parameter won't help you with IB if you are using UCXunless you set this (master only right now):btl_uct_transports=dc_mlx5btl=self,vader,uctosc=^ucxThough there may be a way to get osc/ucx to enable the same sort ofoptimization. I don't know.-NathanOn Nov 06, 2018, at 09:38 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:All,I am currently experimenting with MPI atomic operations and wanted toshare some interesting results I am observing. The numbers below aremeasurements from both an IB-based cluster and our Cray XC40. Thebenchmarks look like the following snippet:```if (rank == 1) {uint64_t res, val;for (size_t i = 0; i < NUM_REPS; ++i) {MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win);MPI_Win_flush(target, win);}}MPI_Barrier(MPI_COMM_WORLD);```Only rank 1 performs atomic operations, rank 0 waits in a barrier (Ihave tried to confirm that the operations are done in hardware byletting rank 0 sleep for a while and ensuring that communicationprogresses). Of particular interest for my use-case is fetch_op but I amincluding other operations here nevertheless:* Linux Cluster, IB QDR *average of 100000 iterationsExclusive lock, MPI_UINT32_T:fetch_op: 4.323384uscompare_exchange: 2.035905usaccumulate: 4.326358usget_accumulate: 4.334831usExclusive lock, MPI_UINT64_T:fetch_op: 2.438080uscompare_exchange: 2.398836usaccumulate: 2.435378usget_accumulate: 2.448347usShared lock, MPI_UINT32_T:fetch_op: 6.819977uscompare_exchange: 4.551417usaccumulate: 6.807766usget_accumulate: 6.817602usShared lock, MPI_UINT64_T:fetch_op: 4.954860uscompare_exchange: 2.399373usaccumulate: 4.965702usget_accumulate: 4.977876usThere are two interesting observations:a) operations on 64bit operands generally seem to have lower latenciesthan operations on 32bitb) Using an exclusive lock leads to lower latenciesOverall, there is a factor of almost 3 between SharedLock+uint32_t andExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate(compare_exchange seems to be somewhat of an outlier).* Cray XC40, Aries *average of 100000 iterationsExclusive lock, MPI_UINT32_T:fetch_op: 2.011794uscompare_exchange: 1.740825usaccumulate: 1.795500usget_accumulate: 1.985409usExclusive lock, MPI_UINT64_T:fetch_op: 2.017172uscompare_exchange: 1.846202usaccumulate: 1.812578usget_accumulate: 2.005541usShared lock, MPI_UINT32_T:fetch_op: 5.380455uscompare_exchange: 5.164458usaccumulate: 5.230184usget_accumulate: 5.399722usShared lock, MPI_UINT64_T:fetch_op: 5.415230uscompare_exchange: 1.855840usaccumulate: 5.212632usget_accumulate: 5.396110usThe difference between exclusive and shared lock is about the same aswith IB and the latencies for 32bit vs 64bit are roughly the same(except for compare_exchange, it seems).So my question is: is this to be expected? Is the higher latency whenusing a shared lock caused by an internal lock being acquired becausethe hardware operations are not actually atomic?I'd be grateful for any insight on this.Cheers,Joseph--Dipl.-Inf. Joseph SchuchartHigh Performance Computing Center Stuttgart (HLRS)Nobelstr. 19D-70569 StuttgartTel.: +49(0)711-68565890Fax: +49(0)711-6856832E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>_______________________________________________users mailing listusers@lists.open-mpi.org <mailto:users@lists.open-mpi.org>https://lists.open-mpi.org/mailman/listinfo/users_______________________________________________users mailing listusers@lists.open-mpi.orghttps://lists.open-mpi.org/mailman/listinfo/users_______________________________________________users mailing listusers@lists.open-mpi.orghttps://lists.open-mpi.org/mailman/listinfo/users_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
#include <mpi.h> #include <stdio.h> #include <stdint.h> #include <string.h> #include <stdlib.h> #include <assert.h> #include <unistd.h>
#define NUM_ITER 1000 int main(int argc, char **argv) { MPI_Init(&argc, &argv); void *baseptr; MPI_Win win; int comm_size; int comm_rank; const int64_t one = 1; const int64_t mone = -one; MPI_Comm_size(MPI_COMM_WORLD, &comm_size); MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank); // a single value that is atomically updated by all processes int win_size = sizeof(int64_t); MPI_Info win_info; MPI_Info_create(&win_info); MPI_Info_set(win_info, "accumulate_ordering", "none"); MPI_Info_set(win_info, "same_size" , "true"); MPI_Info_set(win_info, "same_disp_unit" , "true"); MPI_Info_set(win_info, "accumulate_ops" , "same_op_no_op"); MPI_Win_allocate( win_size, 1, win_info, MPI_COMM_WORLD, &baseptr, &win); MPI_Info_free(&win_info); MPI_Win_lock_all(0, win); memset(baseptr, 0, win_size); MPI_Barrier(MPI_COMM_WORLD); if (comm_rank > 0) { for (int i = 0; i < NUM_ITER; ++i) { int64_t result; // increment by one MPI_Fetch_and_op(&one, &result, MPI_INT64_T, 0, 0, MPI_SUM, win); MPI_Win_flush(0, win); } // signal completion MPI_Request req; MPI_Ibarrier(MPI_COMM_WORLD, &req); MPI_Wait(&req, MPI_STATUS_IGNORE); } else { int flag; int64_t sum = 0; const int64_t neg_value = -((int64_t)UINT32_MAX); MPI_Request req; MPI_Ibarrier(MPI_COMM_WORLD, &req); do { int64_t value; int64_t update = neg_value; // fetch value and set to large negative value MPI_Fetch_and_op(&update, &value, MPI_INT64_T, 0, 0, MPI_SUM, win); MPI_Win_flush(0, win); //printf("value: %ld\n", value); // the value should be positive as we have reset it in the previous iteration // Note: this assert triggers on Cray XC40 //assert(value >= 0); // reset update = -neg_value; MPI_Fetch_and_op(&update, &value, MPI_INT64_T, 0, 0, MPI_SUM, win); MPI_Win_flush(0, win); // check for barrier to complete MPI_Test(&req, &flag, MPI_STATUS_IGNORE); } while (flag == 0); // read the final value MPI_Fetch_and_op(NULL, &sum, MPI_INT64_T, 0, 0, MPI_NO_OP, win); MPI_Win_flush(0, win); printf("result:%ld\n", sum); assert(sum == NUM_ITER*(comm_size-1)); } MPI_Win_unlock_all(win); MPI_Win_free(&win); MPI_Finalize(); return 0; }
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users