Re: [OMPI users] Latencies of atomic operations on high-performance networks
> On May 9, 2019, at 12:37 AM, Joseph Schuchart via users > wrote: > > Nathan, > > Over the last couple of weeks I made some more interesting observations > regarding the latencies of accumulate operations on both Aries and InfiniBand > systems: > > 1) There seems to be a significant difference between 64bit and 32bit > operations: on Aries, the average latency for compare-exchange on 64bit > values takes about 1.8us while on 32bit values it's at 3.9us, a factor of > >2x. On the IB cluster, all of fetch-and-op, compare-exchange, and accumulate > show a similar difference between 32 and 64bit. There are no differences > between 32bit and 64bit puts and gets on these systems. 1) On Aries 32-bit and 64-bit CAS operations should have similar performance. This looks like a bug and I will try to track it down now. 2) On Infiniband when using verbs we only have access to 64-bit atomic memory operations (limitation of the now-dead btl/openib component). I think there may be support in UCX for 32-bit AMOs but the support is not implemented in Open MPI (at least not in btl/uct). I can take a look at btl/uct and see what I find. > 2) On both systems, the latency for a single-value atomic load using > MPI_Fetch_and_op + MPI_NO_OP is 2x that of MPI_Fetch_and_op + MPI_SUM on > 64bit values, roughly matching the latency of 32bit compare-exchange > operations. This is expected given the current implementation. When doing MPI_OP_NO_OP it falls back to the lock + get. I suppose I can change it to use MPI_SUM with an operand of 0. Will investigate. -Nathan ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] MPI failing on Infiniband (queue pair error)
You might want to try two things: 1. Upgrade to Open MPI v4.0.1. 2. Use the UCX PML instead of the openib BTL. You may need to download/install UCX first. Then configure Open MPI: ./configure --with-ucx --without-verbs --enable-mca-no-build=btl-uct ... This will build the UCX PML, and that should get used by default when you mpirun. Note that the "--enable-mca-no-build..." option is because it looks like we have a plugin (the BTL UCT plugin, to be specific) in the v4.0.1 release that does not compile successfully with the latest version of UCX. This will be fixed in a subsequent Open MPI v4.0.x release. > On May 9, 2019, at 10:17 AM, Koutsoukos Dimitrios via users > wrote: > > Hi all, > > I am trying to run MPI on a distributed mode. The cluster setup is an > 8-machine cluster with Debian 8 (Jessie), Intel Xeon E5-2609 2.40 GHz and > Mellanox-QDR HCA Infiniband. My MPI version is 3.0.4. I can successfully run > a simple command on all nodes that doesn’t use the infiniband but when I am > running my experiments I am receiving the following error from one of the > nodes: > - > Failed to modify the attributes of a queue pair (QP): > > Hostname: euler04 > Mask for QP attributes to be modified: 65537 > Error:Invalid argument > -- > -- > Open MPI has detected that there are UD-capable Verbs devices on your > system, but none of them were able to be setup properly. This may > indicate a problem on this system. > > You job will continue, but Open MPI will ignore the "ud" oob component > in this run. > > Hostname: euler04 > -- > -- > Failed to modify the attributes of a queue pair (QP): > > Hostname: euler04 > Mask for QP attributes to be modified: 65537 > Error:Invalid argument > -- > -- > Open MPI has detected that there are UD-capable Verbs devices on your > system, but none of them were able to be setup properly. This may > indicate a problem on this system. > > You job will continue, but Open MPI will ignore the "ud" oob component > in this run. > > Hostname: euler04 > -- > [euler04][[29717,1],29][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] > error modifing QP to RTS errno says Invalid argument; errno=22 > [euler04][[29717,1],25][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] > error modifing QP to RTS errno says Invalid argument; errno=22 > [euler04][[29717,1],24][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] > error modifing QP to RTS errno says Invalid argument; errno=22 > [euler04][[29717,1],31][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] > error modifing QP to RTS errno says Invalid argument; errno=22 > [euler04][[29717,1],30][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] > error modifing QP to RTS errno says Invalid argument; errno=22 > [euler04][[29717,1],27][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] > error modifing QP to RTS errno says Invalid argument; errno=22 > [euler04][[29717,1],26][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] > error modifing QP to RTS errno says Invalid argument; errno=22 > [euler04][[29717,1],28][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] > error modifing QP to RTS errno says Invalid argument; errno=22 > > Note that I am compiling MPI from source on a shared NFS using the commands: > ./configure prefix=/path/to/NFS/ > make > make install > And also that my cluster configuration in all of the nodes is the same. I am > running my job using /path/to/NFS/mpirun —hostfile hostfile > ./executable_name. I am not receiving any error when I am excluding this > host. Is this a hardware error? Should I try a different MPI version? Any > help would be appreciated. > > Thanks very much in advance for your help, > Dimitris > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users -- Jeff Squyres jsquy...@cisco.com ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Problems in 4.0.1 version and printf function
Stdout forwarding should continue to work in v4.0.x just like it did in v3.0.x. I.e., printf's from your app should appear in the stdout of mpirun. Sometimes they can get buffered, however, such as if you redirect the stdout to a file or to a pipe. Such shell buffering may only emit output when it sees \n's, or a buffer-full at a time (e.g., 4K?). For example (typed directly in my mail client -- YMMV): --- $ cat foo.c #include int main(int args, char* argv[]) { MPI_Init(NULL, NULL); printf("Hello, world\n"); sleep(60); MPI_Finalize(); Return 0; } $ mpicc foo.c -o foo $ mpirun -np 1 foo Hello, world <...waits 60 seconds...> $ mpirun -np 1 foo | tee out.txt <...waits 60 seconds> Hello, world $ --- The latter run may not emit anything for 60 seconds because the single line of "Hello, world" may not have been an entire buffer-full, so the pipe chose not to emit it until a) it fills its buffer, or b) the program completes. Buffering like that depends very much on how the buffering is implemented (e.g., by the shell). So it may be outside of Open MPI's control. > On May 8, 2019, at 9:02 PM, Nilton Luiz Queiroz Junior via users > wrote: > > I had upgraded my MPI from 3.0.0 to 4.0.1 version, so when I compile (with > mpicc) and run (with mpirun) any .c source file that uses printf function > from stdio, i dont get the print results in stdout. Does anyone knows what it > can be? > > I'm very thankful for your attention. > > Nilton. > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users -- Jeff Squyres jsquy...@cisco.com ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] MPI failing on Infiniband (queue pair error)
Hi all, I am trying to run MPI on a distributed mode. The cluster setup is an 8-machine cluster with Debian 8 (Jessie), Intel Xeon E5-2609 2.40 GHz and Mellanox-QDR HCA Infiniband. My MPI version is 3.0.4. I can successfully run a simple command on all nodes that doesn’t use the infiniband but when I am running my experiments I am receiving the following error from one of the nodes: - Failed to modify the attributes of a queue pair (QP): Hostname: euler04 Mask for QP attributes to be modified: 65537 Error:Invalid argument -- -- Open MPI has detected that there are UD-capable Verbs devices on your system, but none of them were able to be setup properly. This may indicate a problem on this system. You job will continue, but Open MPI will ignore the "ud" oob component in this run. Hostname: euler04 -- -- Failed to modify the attributes of a queue pair (QP): Hostname: euler04 Mask for QP attributes to be modified: 65537 Error:Invalid argument -- -- Open MPI has detected that there are UD-capable Verbs devices on your system, but none of them were able to be setup properly. This may indicate a problem on this system. You job will continue, but Open MPI will ignore the "ud" oob component in this run. Hostname: euler04 -- [euler04][[29717,1],29][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] error modifing QP to RTS errno says Invalid argument; errno=22 [euler04][[29717,1],25][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] error modifing QP to RTS errno says Invalid argument; errno=22 [euler04][[29717,1],24][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] error modifing QP to RTS errno says Invalid argument; errno=22 [euler04][[29717,1],31][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] error modifing QP to RTS errno says Invalid argument; errno=22 [euler04][[29717,1],30][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] error modifing QP to RTS errno says Invalid argument; errno=22 [euler04][[29717,1],27][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] error modifing QP to RTS errno says Invalid argument; errno=22 [euler04][[29717,1],26][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] error modifing QP to RTS errno says Invalid argument; errno=22 [euler04][[29717,1],28][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] error modifing QP to RTS errno says Invalid argument; errno=22 Note that I am compiling MPI from source on a shared NFS using the commands: ./configure prefix=/path/to/NFS/ make make install And also that my cluster configuration in all of the nodes is the same. I am running my job using /path/to/NFS/mpirun —hostfile hostfile ./executable_name. I am not receiving any error when I am excluding this host. Is this a hardware error? Should I try a different MPI version? Any help would be appreciated. Thanks very much in advance for your help, Dimitris ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Latencies of atomic operations on high-performance networks
I will try to take a look at it today. -Nathan > On May 9, 2019, at 12:37 AM, Joseph Schuchart via users > wrote: > > Nathan, > > Over the last couple of weeks I made some more interesting observations > regarding the latencies of accumulate operations on both Aries and InfiniBand > systems: > > 1) There seems to be a significant difference between 64bit and 32bit > operations: on Aries, the average latency for compare-exchange on 64bit > values takes about 1.8us while on 32bit values it's at 3.9us, a factor of > >2x. On the IB cluster, all of fetch-and-op, compare-exchange, and accumulate > show a similar difference between 32 and 64bit. There are no differences > between 32bit and 64bit puts and gets on these systems. > > 2) On both systems, the latency for a single-value atomic load using > MPI_Fetch_and_op + MPI_NO_OP is 2x that of MPI_Fetch_and_op + MPI_SUM on > 64bit values, roughly matching the latency of 32bit compare-exchange > operations. > > All measurements were done using Open MPI 3.1.2 with > OMPI_MCA_osc_rdma_acc_single_intrinsic=true. Is that behavior expected as > well? > > Thanks, > Joseph > > > On 11/6/18 6:13 PM, Nathan Hjelm via users wrote: >> All of this is completely expected. Due to the requirements of the standard >> it is difficult to make use of network atomics even for MPI_Compare_and_swap >> (MPI_Accumulate and MPI_Get_accumulate spoil the party). If you want >> MPI_Fetch_and_op to be fast set this MCA parameter: >> osc_rdma_acc_single_intrinsic=true >> Shared lock is slower than an exclusive lock because there is an extra lock >> step as part of the accumulate (it isn't needed if there is an exclusive >> lock). When setting the above parameter you are telling the implementation >> that you will only be using a single count and we can optimize that with the >> hardware. The RMA working group is working on an info key that will >> essentially do the same thing. >> Note the above parameter won't help you with IB if you are using UCX unless >> you set this (master only right now): >> btl_uct_transports=dc_mlx5 >> btl=self,vader,uct >> osc=^ucx >> Though there may be a way to get osc/ucx to enable the same sort of >> optimization. I don't know. >> -Nathan >> On Nov 06, 2018, at 09:38 AM, Joseph Schuchart wrote: >>> All, >>> >>> I am currently experimenting with MPI atomic operations and wanted to >>> share some interesting results I am observing. The numbers below are >>> measurements from both an IB-based cluster and our Cray XC40. The >>> benchmarks look like the following snippet: >>> >>> ``` >>> if (rank == 1) { >>> uint64_t res, val; >>> for (size_t i = 0; i < NUM_REPS; ++i) { >>> MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win); >>> MPI_Win_flush(target, win); >>> } >>> } >>> MPI_Barrier(MPI_COMM_WORLD); >>> ``` >>> >>> Only rank 1 performs atomic operations, rank 0 waits in a barrier (I >>> have tried to confirm that the operations are done in hardware by >>> letting rank 0 sleep for a while and ensuring that communication >>> progresses). Of particular interest for my use-case is fetch_op but I am >>> including other operations here nevertheless: >>> >>> * Linux Cluster, IB QDR * >>> average of 10 iterations >>> >>> Exclusive lock, MPI_UINT32_T: >>> fetch_op: 4.323384us >>> compare_exchange: 2.035905us >>> accumulate: 4.326358us >>> get_accumulate: 4.334831us >>> >>> Exclusive lock, MPI_UINT64_T: >>> fetch_op: 2.438080us >>> compare_exchange: 2.398836us >>> accumulate: 2.435378us >>> get_accumulate: 2.448347us >>> >>> Shared lock, MPI_UINT32_T: >>> fetch_op: 6.819977us >>> compare_exchange: 4.551417us >>> accumulate: 6.807766us >>> get_accumulate: 6.817602us >>> >>> Shared lock, MPI_UINT64_T: >>> fetch_op: 4.954860us >>> compare_exchange: 2.399373us >>> accumulate: 4.965702us >>> get_accumulate: 4.977876us >>> >>> There are two interesting observations: >>> a) operations on 64bit operands generally seem to have lower latencies >>> than operations on 32bit >>> b) Using an exclusive lock leads to lower latencies >>> >>> Overall, there is a factor of almost 3 between SharedLock+uint32_t and >>> ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate >>> (compare_exchange seems to be somewhat of an outlier). >>> >>> * Cray XC40, Aries * >>> average of 10 iterations >>> >>> Exclusive lock, MPI_UINT32_T: >>> fetch_op: 2.011794us >>> compare_exchange: 1.740825us >>> accumulate: 1.795500us >>> get_accumulate: 1.985409us >>> >>> Exclusive lock, MPI_UINT64_T: >>> fetch_op: 2.017172us >>> compare_exchange: 1.846202us >>> accumulate: 1.812578us >>> get_accumulate: 2.005541us >>> >>> Shared lock, MPI_UINT32_T: >>> fetch_op: 5.380455us >>> compare_exchange: 5.164458us >>> accumulate: 5.230184us >>> get_accumulate: 5.399722us >>> >>> Shared lock, MPI_UINT64_T: >>> fetch_op: 5.415230us >>> compare_exchange: 1.855840us >>> accumulate: 5.212632us >>> get_accumulate: 5.396110us >>> >>> >>> The differenc
Re: [OMPI users] Latencies of atomic operations on high-performance networks
Benson, I just gave 4.0.1 a shot and the behavior is the same (the reason I'm stuck with 3.1.2 is a regression with `osc_rdma_acc_single_intrinsic` on 4.0 [1]). The IB cluster has both Mellanox ConnectX-3 (w/ Haswell CPU) and ConnectX-4 (w/ Skylake CPU) nodes, the effect is visible on both node types. Joseph [1] https://github.com/open-mpi/ompi/issues/6536 On 5/9/19 9:10 AM, Benson Muite via users wrote: Hi, Have you tried anything with OpenMPI 4.0.1? What are the specifications of the Infiniband system you are using? Benson On 5/9/19 9:37 AM, Joseph Schuchart via users wrote: Nathan, Over the last couple of weeks I made some more interesting observations regarding the latencies of accumulate operations on both Aries and InfiniBand systems: 1) There seems to be a significant difference between 64bit and 32bit operations: on Aries, the average latency for compare-exchange on 64bit values takes about 1.8us while on 32bit values it's at 3.9us, a factor of >2x. On the IB cluster, all of fetch-and-op, compare-exchange, and accumulate show a similar difference between 32 and 64bit. There are no differences between 32bit and 64bit puts and gets on these systems. 2) On both systems, the latency for a single-value atomic load using MPI_Fetch_and_op + MPI_NO_OP is 2x that of MPI_Fetch_and_op + MPI_SUM on 64bit values, roughly matching the latency of 32bit compare-exchange operations. All measurements were done using Open MPI 3.1.2 with OMPI_MCA_osc_rdma_acc_single_intrinsic=true. Is that behavior expected as well? Thanks, Joseph On 11/6/18 6:13 PM, Nathan Hjelm via users wrote: All of this is completely expected. Due to the requirements of the standard it is difficult to make use of network atomics even for MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the party). If you want MPI_Fetch_and_op to be fast set this MCA parameter: osc_rdma_acc_single_intrinsic=true Shared lock is slower than an exclusive lock because there is an extra lock step as part of the accumulate (it isn't needed if there is an exclusive lock). When setting the above parameter you are telling the implementation that you will only be using a single count and we can optimize that with the hardware. The RMA working group is working on an info key that will essentially do the same thing. Note the above parameter won't help you with IB if you are using UCX unless you set this (master only right now): btl_uct_transports=dc_mlx5 btl=self,vader,uct osc=^ucx Though there may be a way to get osc/ucx to enable the same sort of optimization. I don't know. -Nathan On Nov 06, 2018, at 09:38 AM, Joseph Schuchart wrote: All, I am currently experimenting with MPI atomic operations and wanted to share some interesting results I am observing. The numbers below are measurements from both an IB-based cluster and our Cray XC40. The benchmarks look like the following snippet: ``` if (rank == 1) { uint64_t res, val; for (size_t i = 0; i < NUM_REPS; ++i) { MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win); MPI_Win_flush(target, win); } } MPI_Barrier(MPI_COMM_WORLD); ``` Only rank 1 performs atomic operations, rank 0 waits in a barrier (I have tried to confirm that the operations are done in hardware by letting rank 0 sleep for a while and ensuring that communication progresses). Of particular interest for my use-case is fetch_op but I am including other operations here nevertheless: * Linux Cluster, IB QDR * average of 10 iterations Exclusive lock, MPI_UINT32_T: fetch_op: 4.323384us compare_exchange: 2.035905us accumulate: 4.326358us get_accumulate: 4.334831us Exclusive lock, MPI_UINT64_T: fetch_op: 2.438080us compare_exchange: 2.398836us accumulate: 2.435378us get_accumulate: 2.448347us Shared lock, MPI_UINT32_T: fetch_op: 6.819977us compare_exchange: 4.551417us accumulate: 6.807766us get_accumulate: 6.817602us Shared lock, MPI_UINT64_T: fetch_op: 4.954860us compare_exchange: 2.399373us accumulate: 4.965702us get_accumulate: 4.977876us There are two interesting observations: a) operations on 64bit operands generally seem to have lower latencies than operations on 32bit b) Using an exclusive lock leads to lower latencies Overall, there is a factor of almost 3 between SharedLock+uint32_t and ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate (compare_exchange seems to be somewhat of an outlier). * Cray XC40, Aries * average of 10 iterations Exclusive lock, MPI_UINT32_T: fetch_op: 2.011794us compare_exchange: 1.740825us accumulate: 1.795500us get_accumulate: 1.985409us Exclusive lock, MPI_UINT64_T: fetch_op: 2.017172us compare_exchange: 1.846202us accumulate: 1.812578us get_accumulate: 2.005541us Shared lock, MPI_UINT32_T: fetch_op: 5.380455us compare_exchange: 5.164458us accumulate: 5.230184us get_accumulate: 5.399722us Shared lock, MPI_UINT64_T: fetch_op: 5.415230us compare_exchange: 1.855840us accumulate: 5.212632us get
Re: [OMPI users] Latencies of atomic operations on high-performance networks
Hi, Have you tried anything with OpenMPI 4.0.1? What are the specifications of the Infiniband system you are using? Benson On 5/9/19 9:37 AM, Joseph Schuchart via users wrote: Nathan, Over the last couple of weeks I made some more interesting observations regarding the latencies of accumulate operations on both Aries and InfiniBand systems: 1) There seems to be a significant difference between 64bit and 32bit operations: on Aries, the average latency for compare-exchange on 64bit values takes about 1.8us while on 32bit values it's at 3.9us, a factor of >2x. On the IB cluster, all of fetch-and-op, compare-exchange, and accumulate show a similar difference between 32 and 64bit. There are no differences between 32bit and 64bit puts and gets on these systems. 2) On both systems, the latency for a single-value atomic load using MPI_Fetch_and_op + MPI_NO_OP is 2x that of MPI_Fetch_and_op + MPI_SUM on 64bit values, roughly matching the latency of 32bit compare-exchange operations. All measurements were done using Open MPI 3.1.2 with OMPI_MCA_osc_rdma_acc_single_intrinsic=true. Is that behavior expected as well? Thanks, Joseph On 11/6/18 6:13 PM, Nathan Hjelm via users wrote: All of this is completely expected. Due to the requirements of the standard it is difficult to make use of network atomics even for MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the party). If you want MPI_Fetch_and_op to be fast set this MCA parameter: osc_rdma_acc_single_intrinsic=true Shared lock is slower than an exclusive lock because there is an extra lock step as part of the accumulate (it isn't needed if there is an exclusive lock). When setting the above parameter you are telling the implementation that you will only be using a single count and we can optimize that with the hardware. The RMA working group is working on an info key that will essentially do the same thing. Note the above parameter won't help you with IB if you are using UCX unless you set this (master only right now): btl_uct_transports=dc_mlx5 btl=self,vader,uct osc=^ucx Though there may be a way to get osc/ucx to enable the same sort of optimization. I don't know. -Nathan On Nov 06, 2018, at 09:38 AM, Joseph Schuchart wrote: All, I am currently experimenting with MPI atomic operations and wanted to share some interesting results I am observing. The numbers below are measurements from both an IB-based cluster and our Cray XC40. The benchmarks look like the following snippet: ``` if (rank == 1) { uint64_t res, val; for (size_t i = 0; i < NUM_REPS; ++i) { MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win); MPI_Win_flush(target, win); } } MPI_Barrier(MPI_COMM_WORLD); ``` Only rank 1 performs atomic operations, rank 0 waits in a barrier (I have tried to confirm that the operations are done in hardware by letting rank 0 sleep for a while and ensuring that communication progresses). Of particular interest for my use-case is fetch_op but I am including other operations here nevertheless: * Linux Cluster, IB QDR * average of 10 iterations Exclusive lock, MPI_UINT32_T: fetch_op: 4.323384us compare_exchange: 2.035905us accumulate: 4.326358us get_accumulate: 4.334831us Exclusive lock, MPI_UINT64_T: fetch_op: 2.438080us compare_exchange: 2.398836us accumulate: 2.435378us get_accumulate: 2.448347us Shared lock, MPI_UINT32_T: fetch_op: 6.819977us compare_exchange: 4.551417us accumulate: 6.807766us get_accumulate: 6.817602us Shared lock, MPI_UINT64_T: fetch_op: 4.954860us compare_exchange: 2.399373us accumulate: 4.965702us get_accumulate: 4.977876us There are two interesting observations: a) operations on 64bit operands generally seem to have lower latencies than operations on 32bit b) Using an exclusive lock leads to lower latencies Overall, there is a factor of almost 3 between SharedLock+uint32_t and ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate (compare_exchange seems to be somewhat of an outlier). * Cray XC40, Aries * average of 10 iterations Exclusive lock, MPI_UINT32_T: fetch_op: 2.011794us compare_exchange: 1.740825us accumulate: 1.795500us get_accumulate: 1.985409us Exclusive lock, MPI_UINT64_T: fetch_op: 2.017172us compare_exchange: 1.846202us accumulate: 1.812578us get_accumulate: 2.005541us Shared lock, MPI_UINT32_T: fetch_op: 5.380455us compare_exchange: 5.164458us accumulate: 5.230184us get_accumulate: 5.399722us Shared lock, MPI_UINT64_T: fetch_op: 5.415230us compare_exchange: 1.855840us accumulate: 5.212632us get_accumulate: 5.396110us The difference between exclusive and shared lock is about the same as with IB and the latencies for 32bit vs 64bit are roughly the same (except for compare_exchange, it seems). So my question is: is this to be expected? Is the higher latency when using a shared lock caused by an internal lock being acquired because the hardware operations are not actually atomic? I'd be grateful for any i