Re: [OMPI users] Coordinating (non-overlapping) local stores with remote puts form using passive RMA synchronization

2020-06-02 Thread Joseph Schuchart via users

Hi Stephen,

Let me try to answer your questions inline (I don't have extensive 
experience with the separate model and from my experience most 
implementations support the unified model, with some exceptions):


On 5/31/20 1:31 AM, Stephen Guzik via users wrote:

Hi,

I'm trying to get a better understanding of coordinating 
(non-overlapping) local stores with remote puts when using passive 
synchronization for RMA.  I understand that the window should be locked 
for a local store, but can it be a shared lock?


Yes. There is no reason why that cannot be a shared lock.

In my example, each 
process retrieves and increments an index (indexBuf and indexWin) from a 
target process and then stores it's rank into an array (dataBuf and 
dataWin) at that index on the target.  If the target is local, a local 
store is attempted:


/* indexWin on indexBuf, dataWin on dataBuf */
std::vector myvals(numProc);
MPI_Win_lock_all(0, indexWin);
MPI_Win_lock_all(0, dataWin);
for (int tgtProc = 0; tgtProc != numProc; ++tgtProc)
   {
     MPI_Fetch_and_op(, [tgtProc], MPI_INT, tgtProc, 0, 
MPI_SUM,indexWin);

     MPI_Win_flush_local(tgtProc, indexWin);
     // Put our rank into the right location of the target
     if (tgtProc == procID)
   {
     dataBuf[myvals[procID]] = procID;
   }
     else
   {
     MPI_Put(, 1, MPI_INT, tgtProc, myvals[tgtProc], 1, 
MPI_INT,dataWin);

   }
   }
MPI_Win_flush_all(dataWin);  /* Force completion and time 
synchronization */

MPI_Barrier(MPI_COMM_WORLD);
/* Proceed with local loads and unlock windows later */

I believe this is valid for a unified memory model but would probably 
fail for a separate model (unless a separate model very cleverly merges 
a private and public window?)  Is this understanding correct?  And if I 
instead use MPI_Put for the local write, then it should be valid for 
both memory models?


Yes, if you use RMA operations even on local memory it is valid for both 
memory models.


The MPI standard on page 455 (S3) states that "a store to process memory 
to a location in a window must not start once a put or accumulate update 
to that target window has started, until the put or accumulate update 
becomes visible in process memory." So there is no clever merging and it 
is up to the user to ensure that there are no puts and stores happening 
at the same time.




Another approach is specific locks.  I don't like this because it seems 
there are excessive synchronizations.  But if I really want to mix local 
stores and remote puts, is this the only way using locks?


/* indexWin on indexBuf, dataWin on dataBuf */
std::vector myvals(numProc);
for (int tgtProc = 0; tgtProc != numProc; ++tgtProc)
   {
     MPI_Win_lock(MPI_LOCK_SHARED, tgtProc, 0, indexWin);
     MPI_Fetch_and_op(, [tgtProc], MPI_INT, tgtProc, 0, 
MPI_SUM,indexWin);

     MPI_Win_unlock(tgtProc, indexWin);
     // Put our rank into the right location of the target
     if (tgtProc == procID)
   {
     MPI_Win_lock(MPI_LOCK_EXCLUSIVE, tgtProc, 0, dataWin);
     dataBuf[myvals[procID]] = procID;
     MPI_Win_unlock(tgtProc, dataWin);  /*(A)*/
   }
     else
   {
     MPI_Win_lock(MPI_LOCK_SHARED, tgtProc, 0, dataWin);
     MPI_Put(, 1, MPI_INT, tgtProc, myvals[tgtProc], 1, 
MPI_INT,dataWin);

     MPI_Win_unlock(tgtProc, dataWin);
   }
   }
/* Proceed with local loads */

I believe this is also valid for both memory models?  An unlock must 
have followed the last access to the local window, before the exclusive 
lock is gained.  That should have synchronized the windows and another 
synchronization should happen at (A).  Is that understanding correct? 


That is correct for both memory models, yes. It is likely to be slower 
because locking and unlocking involves some effort. You are better off 
using put instead.


If you really want to use local stores you can check for the 
MPI_WIN_UNIFIED attribute and fall-back to using puts only for the 
separate model.


> If so, how does one ever get into a situation where MPI_Win_sync must 
be used?


You can think of a synchronization scheme where each process takes a 
shared lock on a window, stores data to a local location, calls 
MPI_Win_sync and signals to other processes that the data is now 
available, e.g., through a barrier or a send. In that case processes 
keep the lock and use some non-RMA synchronization instead.




Final question.  In the first example, let's say there is a lot of 
computation in the loop and I want the MPI_Puts to immediately make 
progress.  Would it be sensible to follow the MPI_Put with a 
MPI_Win_flush_local to get things moving?  Or is it best to avoid any 
unnecessary synchronizations?


That is highly implementation-specific. Some implementations may buffer 
the puts and delay the transfer to the flush, some may initiate it 
immediately, and some may treat a local flush similar to a regular 
flush. I would not make any assumptions about the underlying 

[OMPI users] Coordinating (non-overlapping) local stores with remote puts form using passive RMA synchronization

2020-05-30 Thread Stephen Guzik via users

Hi,

I'm trying to get a better understanding of coordinating 
(non-overlapping) local stores with remote puts when using passive 
synchronization for RMA.  I understand that the window should be locked 
for a local store, but can it be a shared lock?  In my example, each 
process retrieves and increments an index (indexBuf and indexWin) from a 
target process and then stores it's rank into an array (dataBuf and 
dataWin) at that index on the target.  If the target is local, a local 
store is attempted:


/* indexWin on indexBuf, dataWin on dataBuf */
std::vector myvals(numProc);
MPI_Win_lock_all(0, indexWin);
MPI_Win_lock_all(0, dataWin);
for (int tgtProc = 0; tgtProc != numProc; ++tgtProc)
  {
    MPI_Fetch_and_op(, [tgtProc], MPI_INT, tgtProc, 0, 
MPI_SUM,indexWin);

    MPI_Win_flush_local(tgtProc, indexWin);
    // Put our rank into the right location of the target
    if (tgtProc == procID)
  {
    dataBuf[myvals[procID]] = procID;
  }
    else
  {
    MPI_Put(, 1, MPI_INT, tgtProc, myvals[tgtProc], 1, 
MPI_INT,dataWin);

  }
  }
MPI_Win_flush_all(dataWin);  /* Force completion and time synchronization */
MPI_Barrier(MPI_COMM_WORLD);
/* Proceed with local loads and unlock windows later */

I believe this is valid for a unified memory model but would probably 
fail for a separate model (unless a separate model very cleverly merges 
a private and public window?)  Is this understanding correct?  And if I 
instead use MPI_Put for the local write, then it should be valid for 
both memory models?


Another approach is specific locks.  I don't like this because it seems 
there are excessive synchronizations.  But if I really want to mix local 
stores and remote puts, is this the only way using locks?


/* indexWin on indexBuf, dataWin on dataBuf */
std::vector myvals(numProc);
for (int tgtProc = 0; tgtProc != numProc; ++tgtProc)
  {
    MPI_Win_lock(MPI_LOCK_SHARED, tgtProc, 0, indexWin);
    MPI_Fetch_and_op(, [tgtProc], MPI_INT, tgtProc, 0, 
MPI_SUM,indexWin);

    MPI_Win_unlock(tgtProc, indexWin);
    // Put our rank into the right location of the target
    if (tgtProc == procID)
  {
    MPI_Win_lock(MPI_LOCK_EXCLUSIVE, tgtProc, 0, dataWin);
    dataBuf[myvals[procID]] = procID;
    MPI_Win_unlock(tgtProc, dataWin);  /*(A)*/
  }
    else
  {
    MPI_Win_lock(MPI_LOCK_SHARED, tgtProc, 0, dataWin);
    MPI_Put(, 1, MPI_INT, tgtProc, myvals[tgtProc], 1, 
MPI_INT,dataWin);

    MPI_Win_unlock(tgtProc, dataWin);
  }
  }
/* Proceed with local loads */

I believe this is also valid for both memory models?  An unlock must 
have followed the last access to the local window, before the exclusive 
lock is gained.  That should have synchronized the windows and another 
synchronization should happen at (A).  Is that understanding correct?  
If so, how does one ever get into a situation where MPI_Win_sync must be 
used?


Final question.  In the first example, let's say there is a lot of 
computation in the loop and I want the MPI_Puts to immediately make 
progress.  Would it be sensible to follow the MPI_Put with a 
MPI_Win_flush_local to get things moving?  Or is it best to avoid any 
unnecessary synchronizations?


Thanks,
Stephen