Re: [OMPI users] Latencies of atomic operations on high-performance networks

2019-05-09 Thread Nathan Hjelm via users











> On May 9, 2019, at 12:37 AM, Joseph Schuchart via users 
>  wrote:
> 
> Nathan,
> 
> Over the last couple of weeks I made some more interesting observations 
> regarding the latencies of accumulate operations on both Aries and InfiniBand 
> systems:
> 
> 1) There seems to be a significant difference between 64bit and 32bit 
> operations: on Aries, the average latency for compare-exchange on 64bit 
> values takes about 1.8us while on 32bit values it's at 3.9us, a factor of 
> >2x. On the IB cluster, all of fetch-and-op, compare-exchange, and accumulate 
> show a similar difference between 32 and 64bit. There are no differences 
> between 32bit and 64bit puts and gets on these systems.


1) On Aries 32-bit and 64-bit CAS operations should have similar performance. 
This looks like a bug and I will try to track it down now.

2) On Infiniband when using verbs we only have access to 64-bit atomic memory 
operations (limitation of the now-dead btl/openib component). I think there may 
be support in UCX for 32-bit AMOs but the support is not implemented in Open 
MPI (at least not in btl/uct). I can take a look at btl/uct and see what I find.

> 2) On both systems, the latency for a single-value atomic load using 
> MPI_Fetch_and_op + MPI_NO_OP is 2x that of MPI_Fetch_and_op + MPI_SUM on 
> 64bit values, roughly matching the latency of 32bit compare-exchange 
> operations.

This is expected given the current implementation. When doing MPI_OP_NO_OP it 
falls back to the lock + get. I suppose I can change it to use MPI_SUM with an 
operand of 0. Will investigate.


-Nathan
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Latencies of atomic operations on high-performance networks

2019-05-09 Thread Nathan Hjelm via users
I will try to take a look at it today.

-Nathan

> On May 9, 2019, at 12:37 AM, Joseph Schuchart via users 
>  wrote:
> 
> Nathan,
> 
> Over the last couple of weeks I made some more interesting observations 
> regarding the latencies of accumulate operations on both Aries and InfiniBand 
> systems:
> 
> 1) There seems to be a significant difference between 64bit and 32bit 
> operations: on Aries, the average latency for compare-exchange on 64bit 
> values takes about 1.8us while on 32bit values it's at 3.9us, a factor of 
> >2x. On the IB cluster, all of fetch-and-op, compare-exchange, and accumulate 
> show a similar difference between 32 and 64bit. There are no differences 
> between 32bit and 64bit puts and gets on these systems.
> 
> 2) On both systems, the latency for a single-value atomic load using 
> MPI_Fetch_and_op + MPI_NO_OP is 2x that of MPI_Fetch_and_op + MPI_SUM on 
> 64bit values, roughly matching the latency of 32bit compare-exchange 
> operations.
> 
> All measurements were done using Open MPI 3.1.2 with 
> OMPI_MCA_osc_rdma_acc_single_intrinsic=true. Is that behavior expected as 
> well?
> 
> Thanks,
> Joseph
> 
> 
> On 11/6/18 6:13 PM, Nathan Hjelm via users wrote:
>> All of this is completely expected. Due to the requirements of the standard 
>> it is difficult to make use of network atomics even for MPI_Compare_and_swap 
>> (MPI_Accumulate and MPI_Get_accumulate spoil the party). If you want 
>> MPI_Fetch_and_op to be fast set this MCA parameter:
>> osc_rdma_acc_single_intrinsic=true
>> Shared lock is slower than an exclusive lock because there is an extra lock 
>> step as part of the accumulate (it isn't needed if there is an exclusive 
>> lock). When setting the above parameter you are telling the implementation 
>> that you will only be using a single count and we can optimize that with the 
>> hardware. The RMA working group is working on an info key that will 
>> essentially do the same thing.
>> Note the above parameter won't help you with IB if you are using UCX unless 
>> you set this (master only right now):
>> btl_uct_transports=dc_mlx5
>> btl=self,vader,uct
>> osc=^ucx
>> Though there may be a way to get osc/ucx to enable the same sort of 
>> optimization. I don't know.
>> -Nathan
>> On Nov 06, 2018, at 09:38 AM, Joseph Schuchart  wrote:
>>> All,
>>> 
>>> I am currently experimenting with MPI atomic operations and wanted to
>>> share some interesting results I am observing. The numbers below are
>>> measurements from both an IB-based cluster and our Cray XC40. The
>>> benchmarks look like the following snippet:
>>> 
>>> ```
>>> if (rank == 1) {
>>> uint64_t res, val;
>>> for (size_t i = 0; i < NUM_REPS; ++i) {
>>> MPI_Fetch_and_op(, , MPI_UINT32_T, 0, 0, MPI_SUM, win);
>>> MPI_Win_flush(target, win);
>>> }
>>> }
>>> MPI_Barrier(MPI_COMM_WORLD);
>>> ```
>>> 
>>> Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
>>> have tried to confirm that the operations are done in hardware by
>>> letting rank 0 sleep for a while and ensuring that communication
>>> progresses). Of particular interest for my use-case is fetch_op but I am
>>> including other operations here nevertheless:
>>> 
>>> * Linux Cluster, IB QDR *
>>> average of 10 iterations
>>> 
>>> Exclusive lock, MPI_UINT32_T:
>>> fetch_op: 4.323384us
>>> compare_exchange: 2.035905us
>>> accumulate: 4.326358us
>>> get_accumulate: 4.334831us
>>> 
>>> Exclusive lock, MPI_UINT64_T:
>>> fetch_op: 2.438080us
>>> compare_exchange: 2.398836us
>>> accumulate: 2.435378us
>>> get_accumulate: 2.448347us
>>> 
>>> Shared lock, MPI_UINT32_T:
>>> fetch_op: 6.819977us
>>> compare_exchange: 4.551417us
>>> accumulate: 6.807766us
>>> get_accumulate: 6.817602us
>>> 
>>> Shared lock, MPI_UINT64_T:
>>> fetch_op: 4.954860us
>>> compare_exchange: 2.399373us
>>> accumulate: 4.965702us
>>> get_accumulate: 4.977876us
>>> 
>>> There are two interesting observations:
>>> a) operations on 64bit operands generally seem to have lower latencies
>>> than operations on 32bit
>>> b) Using an exclusive lock leads to lower latencies
>>> 
>>> Overall, there is a factor of almost 3 between SharedLock+uint32_t and
>>> ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate
>>> (compare_exchange seems to be somewhat of an outlier).
>>> 
>>> * Cray XC40, Aries *
>>> average of 10 iterations
>>> 
>>> Exclusive lock, MPI_UINT32_T:
>>> fetch_op: 2.011794us
>>> compare_exchange: 1.740825us
>>> accumulate: 1.795500us
>>> get_accumulate: 1.985409us
>>> 
>>> Exclusive lock, MPI_UINT64_T:
>>> fetch_op: 2.017172us
>>> compare_exchange: 1.846202us
>>> accumulate: 1.812578us
>>> get_accumulate: 2.005541us
>>> 
>>> Shared lock, MPI_UINT32_T:
>>> fetch_op: 5.380455us
>>> compare_exchange: 5.164458us
>>> accumulate: 5.230184us
>>> get_accumulate: 5.399722us
>>> 
>>> Shared lock, MPI_UINT64_T:
>>> fetch_op: 5.415230us
>>> compare_exchange: 1.855840us
>>> accumulate: 5.212632us
>>> get_accumulate: 5.396110us
>>> 
>>> 
>>> The difference 

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2019-05-09 Thread Joseph Schuchart via users

Benson,

I just gave 4.0.1 a shot and the behavior is the same (the reason I'm 
stuck with 3.1.2 is a regression with `osc_rdma_acc_single_intrinsic` on 
4.0 [1]).


The IB cluster has both Mellanox ConnectX-3 (w/ Haswell CPU) and 
ConnectX-4 (w/ Skylake CPU) nodes, the effect is visible on both node types.


Joseph

[1] https://github.com/open-mpi/ompi/issues/6536

On 5/9/19 9:10 AM, Benson Muite via users wrote:

Hi,

Have you tried anything with OpenMPI 4.0.1?

What are the specifications of the Infiniband system you are using?

Benson

On 5/9/19 9:37 AM, Joseph Schuchart via users wrote:

Nathan,

Over the last couple of weeks I made some more interesting 
observations regarding the latencies of accumulate operations on both 
Aries and InfiniBand systems:


1) There seems to be a significant difference between 64bit and 32bit 
operations: on Aries, the average latency for compare-exchange on 
64bit values takes about 1.8us while on 32bit values it's at 3.9us, a 
factor of >2x. On the IB cluster, all of fetch-and-op, 
compare-exchange, and accumulate show a similar difference between 32 
and 64bit. There are no differences between 32bit and 64bit puts and 
gets on these systems.


2) On both systems, the latency for a single-value atomic load using 
MPI_Fetch_and_op + MPI_NO_OP is 2x that of MPI_Fetch_and_op + MPI_SUM 
on 64bit values, roughly matching the latency of 32bit 
compare-exchange operations.


All measurements were done using Open MPI 3.1.2 with 
OMPI_MCA_osc_rdma_acc_single_intrinsic=true. Is that behavior expected 
as well?


Thanks,
Joseph


On 11/6/18 6:13 PM, Nathan Hjelm via users wrote:


All of this is completely expected. Due to the requirements of the 
standard it is difficult to make use of network atomics even for 
MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the 
party). If you want MPI_Fetch_and_op to be fast set this MCA parameter:


osc_rdma_acc_single_intrinsic=true

Shared lock is slower than an exclusive lock because there is an 
extra lock step as part of the accumulate (it isn't needed if there 
is an exclusive lock). When setting the above parameter you are 
telling the implementation that you will only be using a single count 
and we can optimize that with the hardware. The RMA working group is 
working on an info key that will essentially do the same thing.


Note the above parameter won't help you with IB if you are using UCX 
unless you set this (master only right now):


btl_uct_transports=dc_mlx5

btl=self,vader,uct

osc=^ucx


Though there may be a way to get osc/ucx to enable the same sort of 
optimization. I don't know.



-Nathan


On Nov 06, 2018, at 09:38 AM, Joseph Schuchart  
wrote:



All,

I am currently experimenting with MPI atomic operations and wanted to
share some interesting results I am observing. The numbers below are
measurements from both an IB-based cluster and our Cray XC40. The
benchmarks look like the following snippet:

```
if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
MPI_Fetch_and_op(, , MPI_UINT32_T, 0, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
}
}
MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
have tried to confirm that the operations are done in hardware by
letting rank 0 sleep for a while and ensuring that communication
progresses). Of particular interest for my use-case is fetch_op but 
I am

including other operations here nevertheless:

* Linux Cluster, IB QDR *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies
than operations on 32bit
b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t and
ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate
(compare_exchange seems to be somewhat of an outlier).

* Cray XC40, Aries *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 2.011794us
compare_exchange: 1.740825us
accumulate: 1.795500us
get_accumulate: 1.985409us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.017172us
compare_exchange: 1.846202us
accumulate: 1.812578us
get_accumulate: 2.005541us

Shared lock, MPI_UINT32_T:
fetch_op: 5.380455us
compare_exchange: 5.164458us
accumulate: 5.230184us
get_accumulate: 5.399722us

Shared lock, MPI_UINT64_T:
fetch_op: 5.415230us
compare_exchange: 1.855840us
accumulate: 5.212632us

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2019-05-09 Thread Benson Muite via users

Hi,

Have you tried anything with OpenMPI 4.0.1?

What are the specifications of the Infiniband system you are using?

Benson

On 5/9/19 9:37 AM, Joseph Schuchart via users wrote:

Nathan,

Over the last couple of weeks I made some more interesting 
observations regarding the latencies of accumulate operations on both 
Aries and InfiniBand systems:


1) There seems to be a significant difference between 64bit and 32bit 
operations: on Aries, the average latency for compare-exchange on 
64bit values takes about 1.8us while on 32bit values it's at 3.9us, a 
factor of >2x. On the IB cluster, all of fetch-and-op, 
compare-exchange, and accumulate show a similar difference between 32 
and 64bit. There are no differences between 32bit and 64bit puts and 
gets on these systems.


2) On both systems, the latency for a single-value atomic load using 
MPI_Fetch_and_op + MPI_NO_OP is 2x that of MPI_Fetch_and_op + MPI_SUM 
on 64bit values, roughly matching the latency of 32bit 
compare-exchange operations.


All measurements were done using Open MPI 3.1.2 with 
OMPI_MCA_osc_rdma_acc_single_intrinsic=true. Is that behavior expected 
as well?


Thanks,
Joseph


On 11/6/18 6:13 PM, Nathan Hjelm via users wrote:


All of this is completely expected. Due to the requirements of the 
standard it is difficult to make use of network atomics even for 
MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the 
party). If you want MPI_Fetch_and_op to be fast set this MCA parameter:


osc_rdma_acc_single_intrinsic=true

Shared lock is slower than an exclusive lock because there is an 
extra lock step as part of the accumulate (it isn't needed if there 
is an exclusive lock). When setting the above parameter you are 
telling the implementation that you will only be using a single count 
and we can optimize that with the hardware. The RMA working group is 
working on an info key that will essentially do the same thing.


Note the above parameter won't help you with IB if you are using UCX 
unless you set this (master only right now):


btl_uct_transports=dc_mlx5

btl=self,vader,uct

osc=^ucx


Though there may be a way to get osc/ucx to enable the same sort of 
optimization. I don't know.



-Nathan


On Nov 06, 2018, at 09:38 AM, Joseph Schuchart  
wrote:



All,

I am currently experimenting with MPI atomic operations and wanted to
share some interesting results I am observing. The numbers below are
measurements from both an IB-based cluster and our Cray XC40. The
benchmarks look like the following snippet:

```
if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
MPI_Fetch_and_op(, , MPI_UINT32_T, 0, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
}
}
MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
have tried to confirm that the operations are done in hardware by
letting rank 0 sleep for a while and ensuring that communication
progresses). Of particular interest for my use-case is fetch_op but 
I am

including other operations here nevertheless:

* Linux Cluster, IB QDR *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies
than operations on 32bit
b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t and
ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate
(compare_exchange seems to be somewhat of an outlier).

* Cray XC40, Aries *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 2.011794us
compare_exchange: 1.740825us
accumulate: 1.795500us
get_accumulate: 1.985409us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.017172us
compare_exchange: 1.846202us
accumulate: 1.812578us
get_accumulate: 2.005541us

Shared lock, MPI_UINT32_T:
fetch_op: 5.380455us
compare_exchange: 5.164458us
accumulate: 5.230184us
get_accumulate: 5.399722us

Shared lock, MPI_UINT64_T:
fetch_op: 5.415230us
compare_exchange: 1.855840us
accumulate: 5.212632us
get_accumulate: 5.396110us


The difference between exclusive and shared lock is about the same as
with IB and the latencies for 32bit vs 64bit are roughly the same
(except for compare_exchange, it seems).

So my question is: is this to be expected? Is the higher latency when
using a shared lock caused by an internal lock being acquired because
the hardware operations are not actually atomic?

I'd be grateful for any insight 

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2019-05-09 Thread Joseph Schuchart via users

Nathan,

Over the last couple of weeks I made some more interesting observations 
regarding the latencies of accumulate operations on both Aries and 
InfiniBand systems:


1) There seems to be a significant difference between 64bit and 32bit 
operations: on Aries, the average latency for compare-exchange on 64bit 
values takes about 1.8us while on 32bit values it's at 3.9us, a factor 
of >2x. On the IB cluster, all of fetch-and-op, compare-exchange, and 
accumulate show a similar difference between 32 and 64bit. There are no 
differences between 32bit and 64bit puts and gets on these systems.


2) On both systems, the latency for a single-value atomic load using 
MPI_Fetch_and_op + MPI_NO_OP is 2x that of MPI_Fetch_and_op + MPI_SUM on 
64bit values, roughly matching the latency of 32bit compare-exchange 
operations.


All measurements were done using Open MPI 3.1.2 with 
OMPI_MCA_osc_rdma_acc_single_intrinsic=true. Is that behavior expected 
as well?


Thanks,
Joseph


On 11/6/18 6:13 PM, Nathan Hjelm via users wrote:


All of this is completely expected. Due to the requirements of the 
standard it is difficult to make use of network atomics even for 
MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the 
party). If you want MPI_Fetch_and_op to be fast set this MCA parameter:


osc_rdma_acc_single_intrinsic=true

Shared lock is slower than an exclusive lock because there is an extra 
lock step as part of the accumulate (it isn't needed if there is an 
exclusive lock). When setting the above parameter you are telling the 
implementation that you will only be using a single count and we can 
optimize that with the hardware. The RMA working group is working on an 
info key that will essentially do the same thing.


Note the above parameter won't help you with IB if you are using UCX 
unless you set this (master only right now):


btl_uct_transports=dc_mlx5

btl=self,vader,uct

osc=^ucx


Though there may be a way to get osc/ucx to enable the same sort of 
optimization. I don't know.



-Nathan


On Nov 06, 2018, at 09:38 AM, Joseph Schuchart  wrote:


All,

I am currently experimenting with MPI atomic operations and wanted to
share some interesting results I am observing. The numbers below are
measurements from both an IB-based cluster and our Cray XC40. The
benchmarks look like the following snippet:

```
if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
MPI_Fetch_and_op(, , MPI_UINT32_T, 0, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
}
}
MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
have tried to confirm that the operations are done in hardware by
letting rank 0 sleep for a while and ensuring that communication
progresses). Of particular interest for my use-case is fetch_op but I am
including other operations here nevertheless:

* Linux Cluster, IB QDR *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies
than operations on 32bit
b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t and
ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate
(compare_exchange seems to be somewhat of an outlier).

* Cray XC40, Aries *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 2.011794us
compare_exchange: 1.740825us
accumulate: 1.795500us
get_accumulate: 1.985409us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.017172us
compare_exchange: 1.846202us
accumulate: 1.812578us
get_accumulate: 2.005541us

Shared lock, MPI_UINT32_T:
fetch_op: 5.380455us
compare_exchange: 5.164458us
accumulate: 5.230184us
get_accumulate: 5.399722us

Shared lock, MPI_UINT64_T:
fetch_op: 5.415230us
compare_exchange: 1.855840us
accumulate: 5.212632us
get_accumulate: 5.396110us


The difference between exclusive and shared lock is about the same as
with IB and the latencies for 32bit vs 64bit are roughly the same
(except for compare_exchange, it seems).

So my question is: is this to be expected? Is the higher latency when
using a shared lock caused by an internal lock being acquired because
the hardware operations are not actually atomic?

I'd be grateful for any insight on this.

Cheers,
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: 

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-08 Thread Nathan Hjelm via users

Ok, then it sounds like a regression. I will try to track it down today or 
tomorrow.


-Nathan

On Nov 08, 2018, at 01:41 PM, Joseph Schuchart  wrote:


Sorry for the delay, I wanted to make sure that I test the same version
on both Aries and IB: git master bbe5da4. I realized that I had
previously tested with 3.1.3 on the IB cluster, which ran fine. If I use
the same version I run into the same problem on both systems (with --mca
btl_openib_allow_ib true --mca osc_rdma_acc_single_intrinsic true). I
have not tried using UCX for this.

Joseph

On 11/8/18 1:20 PM, Nathan Hjelm via users wrote:

Quick scan of the program and it looks ok to me. I will dig deeper and
see if I can determine the underlying cause.


What Open MPI version are you using?


-Nathan


On Nov 08, 2018, at 11:10 AM, Joseph Schuchart  wrote:


While using the mca parameter in a real application I noticed a strange
effect, which took me a while to figure out: It appears that on the
Aries network the accumulate operations are not atomic anymore. I am
attaching a test program that shows the problem: all but one processes
continuously increment a counter while rank 0 is continuously
subtracting a large value and adding it again, eventually checking for
the correct number of increments. Without the mca parameter the test at
the end succeeds as all increments are accounted for:


```
$ mpirun -n 16 -N 1 ./mpi_fetch_op_local_remote
result:15000
```


When setting the mca parameter the test fails with garbage in the result:


```
$ mpirun --mca osc_rdma_acc_single_intrinsic true -n 16 -N 1
./mpi_fetch_op_local_remote
result:25769849013
mpi_fetch_op_local_remote: mpi_fetch_op_local_remote.c:97: main:
Assertion `sum == 1000*(comm_size-1)' failed.
```


All processes perform only MPI_Fetch_and_op in combination with MPI_SUM
so I assume that the test in combination with the mca flag is correct. I
cannot reproduce this issue on our IB cluster.


Is that an issue in Open MPI or is there some problem in the test case
that I am missing?


Thanks in advance,
Joseph




On 11/6/18 1:15 PM, Joseph Schuchart wrote:
Thanks a lot for the quick reply, setting
osc_rdma_acc_single_intrinsic=true does the trick for both shared and
exclusive locks and brings it down to <2us per operation. I hope that
the info key will make it into the next version of the standard, I
certainly have use for it :)


Cheers,
Joseph


On 11/6/18 12:13 PM, Nathan Hjelm via users wrote:


All of this is completely expected. Due to the requirements of the
standard it is difficult to make use of network atomics even for
MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the
party). If you want MPI_Fetch_and_op to be fast set this MCA parameter:


osc_rdma_acc_single_intrinsic=true


Shared lock is slower than an exclusive lock because there is an extra
lock step as part of the accumulate (it isn't needed if there is an
exclusive lock). When setting the above parameter you are telling the
implementation that you will only be using a single count and we can
optimize that with the hardware. The RMA working group is working on
an info key that will essentially do the same thing.


Note the above parameter won't help you with IB if you are using UCX
unless you set this (master only right now):


btl_uct_transports=dc_mlx5


btl=self,vader,uct


osc=^ucx




Though there may be a way to get osc/ucx to enable the same sort of
optimization. I don't know.




-Nathan




On Nov 06, 2018, at 09:38 AM, Joseph Schuchart 
wrote:


All,


I am currently experimenting with MPI atomic operations and wanted to
share some interesting results I am observing. The numbers below are
measurements from both an IB-based cluster and our Cray XC40. The
benchmarks look like the following snippet:


```
if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
MPI_Fetch_and_op(, , MPI_UINT32_T, 0, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
}
}
MPI_Barrier(MPI_COMM_WORLD);
```


Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
have tried to confirm that the operations are done in hardware by
letting rank 0 sleep for a while and ensuring that communication
progresses). Of particular interest for my use-case is fetch_op but
I am
including other operations here nevertheless:


* Linux Cluster, IB QDR *
average of 10 iterations


Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us


Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us


Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us


Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us


There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies
than 

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-08 Thread Joseph Schuchart
Sorry for the delay, I wanted to make sure that I test the same version 
on both Aries and IB: git master bbe5da4. I realized that I had 
previously tested with 3.1.3 on the IB cluster, which ran fine. If I use 
the same version I run into the same problem on both systems (with --mca 
btl_openib_allow_ib true --mca osc_rdma_acc_single_intrinsic true). I 
have not tried using UCX for this.


Joseph

On 11/8/18 1:20 PM, Nathan Hjelm via users wrote:
Quick scan of the program and it looks ok to me. I will dig deeper and 
see if I can determine the underlying cause.


What Open MPI version are you using?

-Nathan

On Nov 08, 2018, at 11:10 AM, Joseph Schuchart  wrote:


While using the mca parameter in a real application I noticed a strange
effect, which took me a while to figure out: It appears that on the
Aries network the accumulate operations are not atomic anymore. I am
attaching a test program that shows the problem: all but one processes
continuously increment a counter while rank 0 is continuously
subtracting a large value and adding it again, eventually checking for
the correct number of increments. Without the mca parameter the test at
the end succeeds as all increments are accounted for:

```
$ mpirun -n 16 -N 1 ./mpi_fetch_op_local_remote
result:15000
```

When setting the mca parameter the test fails with garbage in the result:

```
$ mpirun --mca osc_rdma_acc_single_intrinsic true -n 16 -N 1
./mpi_fetch_op_local_remote
result:25769849013
mpi_fetch_op_local_remote: mpi_fetch_op_local_remote.c:97: main:
Assertion `sum == 1000*(comm_size-1)' failed.
```

All processes perform only MPI_Fetch_and_op in combination with MPI_SUM
so I assume that the test in combination with the mca flag is correct. I
cannot reproduce this issue on our IB cluster.

Is that an issue in Open MPI or is there some problem in the test case
that I am missing?

Thanks in advance,
Joseph


On 11/6/18 1:15 PM, Joseph Schuchart wrote:

Thanks a lot for the quick reply, setting
osc_rdma_acc_single_intrinsic=true does the trick for both shared and
exclusive locks and brings it down to <2us per operation. I hope that
the info key will make it into the next version of the standard, I
certainly have use for it :)

Cheers,
Joseph

On 11/6/18 12:13 PM, Nathan Hjelm via users wrote:


All of this is completely expected. Due to the requirements of the
standard it is difficult to make use of network atomics even for
MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the
party). If you want MPI_Fetch_and_op to be fast set this MCA parameter:

osc_rdma_acc_single_intrinsic=true

Shared lock is slower than an exclusive lock because there is an extra
lock step as part of the accumulate (it isn't needed if there is an
exclusive lock). When setting the above parameter you are telling the
implementation that you will only be using a single count and we can
optimize that with the hardware. The RMA working group is working on
an info key that will essentially do the same thing.

Note the above parameter won't help you with IB if you are using UCX
unless you set this (master only right now):

btl_uct_transports=dc_mlx5

btl=self,vader,uct

osc=^ucx


Though there may be a way to get osc/ucx to enable the same sort of
optimization. I don't know.


-Nathan


On Nov 06, 2018, at 09:38 AM, Joseph Schuchart  
wrote:



All,

I am currently experimenting with MPI atomic operations and wanted to
share some interesting results I am observing. The numbers below are
measurements from both an IB-based cluster and our Cray XC40. The
benchmarks look like the following snippet:

```
if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
MPI_Fetch_and_op(, , MPI_UINT32_T, 0, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
}
}
MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
have tried to confirm that the operations are done in hardware by
letting rank 0 sleep for a while and ensuring that communication
progresses). Of particular interest for my use-case is fetch_op but 
I am

including other operations here nevertheless:

* Linux Cluster, IB QDR *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies
than operations on 32bit
b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t and
ExclusiveLock+uint64_t for 

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-08 Thread Nathan Hjelm via users
Quick scan of the program and it looks ok to me. I will dig deeper and see if I can determine the underlying cause.What Open MPI version are you using?-NathanOn Nov 08, 2018, at 11:10 AM, Joseph Schuchart  wrote:While using the mca parameter in a real application I noticed a strange effect, which took me a while to figure out: It appears that on the Aries network the accumulate operations are not atomic anymore. I am attaching a test program that shows the problem: all but one processes continuously increment a counter while rank 0 is continuously subtracting a large value and adding it again, eventually checking for the correct number of increments. Without the mca parameter the test at the end succeeds as all increments are accounted for:```$ mpirun -n 16 -N 1 ./mpi_fetch_op_local_remoteresult:15000```When setting the mca parameter the test fails with garbage in the result:```$ mpirun --mca osc_rdma_acc_single_intrinsic true -n 16 -N 1 ./mpi_fetch_op_local_remoteresult:25769849013mpi_fetch_op_local_remote: mpi_fetch_op_local_remote.c:97: main: Assertion `sum == 1000*(comm_size-1)' failed.```All processes perform only MPI_Fetch_and_op in combination with MPI_SUM so I assume that the test in combination with the mca flag is correct. I cannot reproduce this issue on our IB cluster.Is that an issue in Open MPI or is there some problem in the test case that I am missing?Thanks in advance,JosephOn 11/6/18 1:15 PM, Joseph Schuchart wrote:Thanks a lot for the quick reply, settingosc_rdma_acc_single_intrinsic=true does the trick for both shared andexclusive locks and brings it down to <2us per operation. I hope thatthe info key will make it into the next version of the standard, Icertainly have use for it :)Cheers,JosephOn 11/6/18 12:13 PM, Nathan Hjelm via users wrote:All of this is completely expected. Due to the requirements of thestandard it is difficult to make use of network atomics even forMPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil theparty). If you want MPI_Fetch_and_op to be fast set this MCA parameter:osc_rdma_acc_single_intrinsic=trueShared lock is slower than an exclusive lock because there is an extralock step as part of the accumulate (it isn't needed if there is anexclusive lock). When setting the above parameter you are telling theimplementation that you will only be using a single count and we canoptimize that with the hardware. The RMA working group is working onan info key that will essentially do the same thing.Note the above parameter won't help you with IB if you are using UCXunless you set this (master only right now):btl_uct_transports=dc_mlx5btl=self,vader,uctosc=^ucxThough there may be a way to get osc/ucx to enable the same sort ofoptimization. I don't know.-NathanOn Nov 06, 2018, at 09:38 AM, Joseph Schuchart  wrote:All,I am currently experimenting with MPI atomic operations and wanted toshare some interesting results I am observing. The numbers below aremeasurements from both an IB-based cluster and our Cray XC40. Thebenchmarks look like the following snippet:```if (rank == 1) {uint64_t res, val;for (size_t i = 0; i < NUM_REPS; ++i) {MPI_Fetch_and_op(, , MPI_UINT32_T, 0, 0, MPI_SUM, win);MPI_Win_flush(target, win);}}MPI_Barrier(MPI_COMM_WORLD);```Only rank 1 performs atomic operations, rank 0 waits in a barrier (Ihave tried to confirm that the operations are done in hardware byletting rank 0 sleep for a while and ensuring that communicationprogresses). Of particular interest for my use-case is fetch_op but I amincluding other operations here nevertheless:* Linux Cluster, IB QDR *average of 10 iterationsExclusive lock, MPI_UINT32_T:fetch_op: 4.323384uscompare_exchange: 2.035905usaccumulate: 4.326358usget_accumulate: 4.334831usExclusive lock, MPI_UINT64_T:fetch_op: 2.438080uscompare_exchange: 2.398836usaccumulate: 2.435378usget_accumulate: 2.448347usShared lock, MPI_UINT32_T:fetch_op: 6.819977uscompare_exchange: 4.551417usaccumulate: 6.807766usget_accumulate: 6.817602usShared lock, MPI_UINT64_T:fetch_op: 4.954860uscompare_exchange: 2.399373usaccumulate: 4.965702usget_accumulate: 4.977876usThere are two interesting observations:a) operations on 64bit operands generally seem to have lower latenciesthan operations on 32bitb) Using an exclusive lock leads to lower latenciesOverall, there is a factor of almost 3 between SharedLock+uint32_t andExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate(compare_exchange seems to be somewhat of an outlier).* Cray XC40, Aries *average of 10 iterationsExclusive lock, MPI_UINT32_T:fetch_op: 2.011794uscompare_exchange: 1.740825usaccumulate: 1.795500usget_accumulate: 1.985409usExclusive lock, MPI_UINT64_T:fetch_op: 2.017172uscompare_exchange: 1.846202usaccumulate: 1.812578usget_accumulate: 2.005541usShared lock, MPI_UINT32_T:fetch_op: 5.380455uscompare_exchange: 5.164458usaccumulate: 5.230184usget_accumulate: 5.399722usShared lock, MPI_UINT64_T:fetch_op: 5.415230uscompare_exchange: 1.855840usaccumulate: 

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-08 Thread Joseph Schuchart
While using the mca parameter in a real application I noticed a strange 
effect, which took me a while to figure out: It appears that on the 
Aries network the accumulate operations are not atomic anymore. I am 
attaching a test program that shows the problem: all but one processes 
continuously increment a counter while rank 0 is continuously 
subtracting a large value and adding it again, eventually checking for 
the correct number of increments. Without the mca parameter the test at 
the end succeeds as all increments are accounted for:


```
$ mpirun -n 16 -N 1 ./mpi_fetch_op_local_remote
result:15000
```

When setting the mca parameter the test fails with garbage in the result:

```
$ mpirun --mca osc_rdma_acc_single_intrinsic true -n 16 -N 1 
./mpi_fetch_op_local_remote

result:25769849013
mpi_fetch_op_local_remote: mpi_fetch_op_local_remote.c:97: main: 
Assertion `sum == 1000*(comm_size-1)' failed.

```

All processes perform only MPI_Fetch_and_op in combination with MPI_SUM 
so I assume that the test in combination with the mca flag is correct. I 
cannot reproduce this issue on our IB cluster.


Is that an issue in Open MPI or is there some problem in the test case 
that I am missing?


Thanks in advance,
Joseph


On 11/6/18 1:15 PM, Joseph Schuchart wrote:
Thanks a lot for the quick reply, setting 
osc_rdma_acc_single_intrinsic=true does the trick for both shared and 
exclusive locks and brings it down to <2us per operation. I hope that 
the info key will make it into the next version of the standard, I 
certainly have use for it :)


Cheers,
Joseph

On 11/6/18 12:13 PM, Nathan Hjelm via users wrote:


All of this is completely expected. Due to the requirements of the 
standard it is difficult to make use of network atomics even for 
MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the 
party). If you want MPI_Fetch_and_op to be fast set this MCA parameter:


osc_rdma_acc_single_intrinsic=true

Shared lock is slower than an exclusive lock because there is an extra 
lock step as part of the accumulate (it isn't needed if there is an 
exclusive lock). When setting the above parameter you are telling the 
implementation that you will only be using a single count and we can 
optimize that with the hardware. The RMA working group is working on 
an info key that will essentially do the same thing.


Note the above parameter won't help you with IB if you are using UCX 
unless you set this (master only right now):


btl_uct_transports=dc_mlx5

btl=self,vader,uct

osc=^ucx


Though there may be a way to get osc/ucx to enable the same sort of 
optimization. I don't know.



-Nathan


On Nov 06, 2018, at 09:38 AM, Joseph Schuchart  wrote:


All,

I am currently experimenting with MPI atomic operations and wanted to
share some interesting results I am observing. The numbers below are
measurements from both an IB-based cluster and our Cray XC40. The
benchmarks look like the following snippet:

```
if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
MPI_Fetch_and_op(, , MPI_UINT32_T, 0, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
}
}
MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
have tried to confirm that the operations are done in hardware by
letting rank 0 sleep for a while and ensuring that communication
progresses). Of particular interest for my use-case is fetch_op but I am
including other operations here nevertheless:

* Linux Cluster, IB QDR *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies
than operations on 32bit
b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t and
ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate
(compare_exchange seems to be somewhat of an outlier).

* Cray XC40, Aries *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 2.011794us
compare_exchange: 1.740825us
accumulate: 1.795500us
get_accumulate: 1.985409us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.017172us
compare_exchange: 1.846202us
accumulate: 1.812578us
get_accumulate: 2.005541us

Shared lock, MPI_UINT32_T:
fetch_op: 5.380455us
compare_exchange: 5.164458us
accumulate: 5.230184us
get_accumulate: 5.399722us

Shared lock, MPI_UINT64_T:
fetch_op: 5.415230us
compare_exchange: 1.855840us
accumulate: 5.212632us

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-06 Thread Joseph Schuchart
Thanks a lot for the quick reply, setting 
osc_rdma_acc_single_intrinsic=true does the trick for both shared and 
exclusive locks and brings it down to <2us per operation. I hope that 
the info key will make it into the next version of the standard, I 
certainly have use for it :)


Cheers,
Joseph

On 11/6/18 12:13 PM, Nathan Hjelm via users wrote:


All of this is completely expected. Due to the requirements of the 
standard it is difficult to make use of network atomics even for 
MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the 
party). If you want MPI_Fetch_and_op to be fast set this MCA parameter:


osc_rdma_acc_single_intrinsic=true

Shared lock is slower than an exclusive lock because there is an extra 
lock step as part of the accumulate (it isn't needed if there is an 
exclusive lock). When setting the above parameter you are telling the 
implementation that you will only be using a single count and we can 
optimize that with the hardware. The RMA working group is working on an 
info key that will essentially do the same thing.


Note the above parameter won't help you with IB if you are using UCX 
unless you set this (master only right now):


btl_uct_transports=dc_mlx5

btl=self,vader,uct

osc=^ucx


Though there may be a way to get osc/ucx to enable the same sort of 
optimization. I don't know.



-Nathan


On Nov 06, 2018, at 09:38 AM, Joseph Schuchart  wrote:


All,

I am currently experimenting with MPI atomic operations and wanted to
share some interesting results I am observing. The numbers below are
measurements from both an IB-based cluster and our Cray XC40. The
benchmarks look like the following snippet:

```
if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
MPI_Fetch_and_op(, , MPI_UINT32_T, 0, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
}
}
MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
have tried to confirm that the operations are done in hardware by
letting rank 0 sleep for a while and ensuring that communication
progresses). Of particular interest for my use-case is fetch_op but I am
including other operations here nevertheless:

* Linux Cluster, IB QDR *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies
than operations on 32bit
b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t and
ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate
(compare_exchange seems to be somewhat of an outlier).

* Cray XC40, Aries *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 2.011794us
compare_exchange: 1.740825us
accumulate: 1.795500us
get_accumulate: 1.985409us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.017172us
compare_exchange: 1.846202us
accumulate: 1.812578us
get_accumulate: 2.005541us

Shared lock, MPI_UINT32_T:
fetch_op: 5.380455us
compare_exchange: 5.164458us
accumulate: 5.230184us
get_accumulate: 5.399722us

Shared lock, MPI_UINT64_T:
fetch_op: 5.415230us
compare_exchange: 1.855840us
accumulate: 5.212632us
get_accumulate: 5.396110us


The difference between exclusive and shared lock is about the same as
with IB and the latencies for 32bit vs 64bit are roughly the same
(except for compare_exchange, it seems).

So my question is: is this to be expected? Is the higher latency when
using a shared lock caused by an internal lock being acquired because
the hardware operations are not actually atomic?

I'd be grateful for any insight on this.

Cheers,
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de 
___
users mailing list
users@lists.open-mpi.org 
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-06 Thread Nathan Hjelm via users



All of this is completely expected. Due to the requirements of the standard it 
is difficult to make use of network atomics even for MPI_Compare_and_swap 
(MPI_Accumulate and MPI_Get_accumulate spoil the party). If you want 
MPI_Fetch_and_op to be fast set this MCA parameter:


osc_rdma_acc_single_intrinsic=true


Shared lock is slower than an exclusive lock because there is an extra lock 
step as part of the accumulate (it isn't needed if there is an exclusive lock). 
When setting the above parameter you are telling the implementation that you 
will only be using a single count and we can optimize that with the hardware. 
The RMA working group is working on an info key that will essentially do the 
same thing.


Note the above parameter won't help you with IB if you are using UCX unless you 
set this (master only right now):



btl_uct_transports=dc_mlx5

btl=self,vader,uct

osc=^ucx




Though there may be a way to get osc/ucx to enable the same sort of 
optimization. I don't know.



-Nathan



On Nov 06, 2018, at 09:38 AM, Joseph Schuchart  wrote:


All,

I am currently experimenting with MPI atomic operations and wanted to
share some interesting results I am observing. The numbers below are
measurements from both an IB-based cluster and our Cray XC40. The
benchmarks look like the following snippet:

```
if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
MPI_Fetch_and_op(, , MPI_UINT32_T, 0, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
}
}
MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
have tried to confirm that the operations are done in hardware by
letting rank 0 sleep for a while and ensuring that communication
progresses). Of particular interest for my use-case is fetch_op but I am
including other operations here nevertheless:

* Linux Cluster, IB QDR *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies
than operations on 32bit
b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t and
ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate
(compare_exchange seems to be somewhat of an outlier).

* Cray XC40, Aries *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 2.011794us
compare_exchange: 1.740825us
accumulate: 1.795500us
get_accumulate: 1.985409us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.017172us
compare_exchange: 1.846202us
accumulate: 1.812578us
get_accumulate: 2.005541us

Shared lock, MPI_UINT32_T:
fetch_op: 5.380455us
compare_exchange: 5.164458us
accumulate: 5.230184us
get_accumulate: 5.399722us

Shared lock, MPI_UINT64_T:
fetch_op: 5.415230us
compare_exchange: 1.855840us
accumulate: 5.212632us
get_accumulate: 5.396110us


The difference between exclusive and shared lock is about the same as
with IB and the latencies for 32bit vs 64bit are roughly the same
(except for compare_exchange, it seems).

So my question is: is this to be expected? Is the higher latency when
using a shared lock caused by an internal lock being acquired because
the hardware operations are not actually atomic?

I'd be grateful for any insight on this.

Cheers,
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Latencies of atomic operations on high-performance networks

2018-11-06 Thread Joseph Schuchart

All,

I am currently experimenting with MPI atomic operations and wanted to 
share some interesting results I am observing. The numbers below are 
measurements from both an IB-based cluster and our Cray XC40. The 
benchmarks look like the following snippet:


```
  if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
  MPI_Fetch_and_op(, , MPI_UINT32_T, 0, 0, MPI_SUM, win);
  MPI_Win_flush(target, win);
}
  }
  MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I 
have tried to confirm that the operations are done in hardware by 
letting rank 0 sleep for a while and ensuring that communication 
progresses). Of particular interest for my use-case is fetch_op but I am 
including other operations here nevertheless:


* Linux Cluster, IB QDR *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies 
than operations on 32bit

b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t and 
ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate 
(compare_exchange seems to be somewhat of an outlier).


* Cray XC40,  Aries *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 2.011794us
compare_exchange: 1.740825us
accumulate: 1.795500us
get_accumulate: 1.985409us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.017172us
compare_exchange: 1.846202us
accumulate: 1.812578us
get_accumulate: 2.005541us

Shared lock, MPI_UINT32_T:
fetch_op: 5.380455us
compare_exchange: 5.164458us
accumulate: 5.230184us
get_accumulate: 5.399722us

Shared lock, MPI_UINT64_T:
fetch_op: 5.415230us
compare_exchange: 1.855840us
accumulate: 5.212632us
get_accumulate: 5.396110us


The difference between exclusive and shared lock is about the same as 
with IB and the latencies for 32bit vs 64bit are roughly the same 
(except for compare_exchange, it seems).


So my question is: is this to be expected? Is the higher latency when 
using a shared lock caused by an internal lock being acquired because 
the hardware operations are not actually atomic?


I'd be grateful for any insight on this.

Cheers,
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users