Well, that is embarrassing! Thank you so much for figuring this out and providing a detailed answer (also thanks to everyone else who tried to reproduce it). I guess I assumed some synchronization in lock_all even though I know that it is not collective. With an additional barrier between initialization and accumulate in our original application things work smoothly.

Best
Joseph


On 03/09/2017 03:10 PM, Steffen Christgau wrote:
Hi Joseph,

in your code, you are updating the local buffer, which is also exposed
via the window, right after the lock_all call, but the stores
(baseptr[i] = 1000 + loffs++, let's call those the buffer
initialization) are may overwrite the outcome of other concurrent
operations, i.e. the accumulate calls in your case.

Another process that has already advanced to the accumulate loop may
change data in the local window, but your local process has not
completed the initialization. Thus you loose the outcome of accumulates
by initialization in case of process skew.

I provoked process skew by adding a

if (comm_rank == 0) {
    sleep(1);
}

before the initialization loop, which enables me to reproduce the wrong
results using GCC 6.3 and OpenMPI 2.0.2 and executing the program with
two MPI processes.

The lock_all call after the buffer initialization gives you no
collective synchronization in the windows' communicator (as hinted on p.
446 in the 3.1 standard). That is, other processes have already
performed their accumulate phase while the local one is still (or not
yet) in the initialization and overwrites the data (see above).

You might consider an EXCLUSIVE lock around your initialization, but
this wont solve the issue, because any other process may do its
accumulate phase after the window creation but before you enter the
buffer initialization loop.

As far as I understand your MWE code, the initialization should complete
before the accumulate loop starts to get the correct results. I suppose
a missing MPI_Barrier before the accumulate loop. Since you are using
the unified model, you can omit the proposed exclusive lock (see above)
as well.

Hope this helps.

Regards, Steffen

On 03/01/2017 04:03 PM, Joseph Schuchart wrote:
Hi all,

We are seeing issues in one of our applications, in which processes in a
shared communicator allocate a shared MPI window and execute
MPI_Accumulate simultaneously on it to iteratively update each process'
values. The test boils down to the sample code attached. Sample output
is as follows:

```
$ mpirun -n 4 ./mpi_shared_accumulate
[1] baseptr[0]: 1010 (expected 1010)
[1] baseptr[1]: 1011 (expected 1011)
[1] baseptr[2]: 1012 (expected 1012)
[1] baseptr[3]: 1013 (expected 1013)
[1] baseptr[4]: 1014 (expected 1014)
[2] baseptr[0]: 1005 (expected 1010) [!!!]
[2] baseptr[1]: 1006 (expected 1011) [!!!]
[2] baseptr[2]: 1007 (expected 1012) [!!!]
[2] baseptr[3]: 1008 (expected 1013) [!!!]
[2] baseptr[4]: 1009 (expected 1014) [!!!]
[3] baseptr[0]: 1010 (expected 1010)
[0] baseptr[0]: 1010 (expected 1010)
[0] baseptr[1]: 1011 (expected 1011)
[0] baseptr[2]: 1012 (expected 1012)
[0] baseptr[3]: 1013 (expected 1013)
[0] baseptr[4]: 1014 (expected 1014)
[3] baseptr[1]: 1011 (expected 1011)
[3] baseptr[2]: 1012 (expected 1012)
[3] baseptr[3]: 1013 (expected 1013)
[3] baseptr[4]: 1014 (expected 1014)
```

Each process should hold the same values but sometimes (not on all
executions) random processes diverge (marked through [!!!]).

I made the following observations:

1) The issue occurs with both OpenMPI 1.10.6 and 2.0.2 but not with
MPICH 3.2.
2) The issue occurs only if the window is allocated through
MPI_Win_allocate_shared, using MPI_Win_allocate works fine.
3) The code assumes that MPI_Accumulate atomically updates individual
elements (please correct me if that is not covered by the MPI standard).

Both OpenMPI and the example code were compiled using GCC 5.4.1 and run
on a Linux system (single node). OpenMPI was configure with
--enable-mpi-thread-multiple and --with-threads but the application is
not multi-threaded. Please let me know if you need any other information.

Cheers
Joseph



_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to