Bruce, which version of OpenMPI are you using ? out of curiosity, did you try your program with an other MPI implementation such as MPICH or it's derivative ? when using derived datatypes (ddt) in one sided communication, the ddt description must be sent with the data. two protocols are internally used - inline for "short" description - within a new message for "long" description assuming your program is correct, I can guess there is a bug in the way ddt "long" description is handled, and I will investigate that.
that being said, it is very likely MPI_Type_create_struct invoked with a high count, will internally generate a long description, so it will always be suboptimal compared to MPI_Type_create_subarray, or other subroutine that can be used because of the "regular shape" of your ddt. Cheers, Gilles On Saturday, April 30, 2016, Palmer, Bruce J <bruce.pal...@pnnl.gov> wrote: > I’ve been trying to recreate the semantics of the Global Array gather and > scatter operations using MPI RMA routines and I’ve run into some issues > with MPI Datatypes. I’ve been focusing on building MPI versions of the GA > gather and scatter calls, which I’ve been trying to implement using MPI > data types built with the MPI_Type_create_struct call. I’ve developed a > test program that simulates copying data into and out of a 1D distributed > array of size NSIZE. Each processor contains a segment of approximately > size NSIZE/nproc and is responsible for assigning every nprocth value in > the array starting with the value indexed by the rank of the array. After > assigning values and synchronizing the distributed data structure, each > processor then reads the values set by the processor of next higher rank > (the process with rank nproc-1 reads the values set by process 0). > > > > The distributed array is represented by and MPI window and created using a > standard MPI_Win_create call. The values in the array are set and read > using MPI RMA operations, either MPI_Get/MPI_Put or MPI_Rget/MPI_Rput. > Three different protocols have been used. The first is to call MPI_Win_lock > and create a shared lock on the remote processor, then call MPI_Put/MPI_Get > and then call MPI_Win_unlock to clear the lock. The second protocol is to > use MPI request-based calls. After the call to MPI_Win_create, > MPI_Win_lock_all is called to start a passive synchronization epoch on the > window. Data is written and read to the distributed array using > MPI_Rput/MPI_Rget immediately followed by a call to MPI_Wait, using the > handle returned by the MPI_Rput/MPI_Rget call. The third protocol also > immediately creates a passive synchronization epoch after window creation, > but uses calls to MPI_Put/MPI_Get immediately followed by a call to > MPI_Win_flush_local. These three protocols seem to cover all the > possibilities that I have seen in other MPI/RMA based implementations of > ARMCI/GA. > > > > The issue that I’ve run into is that these tests seem to work reliably if > I build the data type using the MPI_Type_create_subbarray function but fail > for larger arrays (NSIZE ~ 10000) when I use MPI_Type_create_struct. > Because the values being set by each processor are evenly spaced, I can use > either function in this case (this is not generally true in applications). > The struct data type hangs on 2 processors using lock/unlock, crashes for > the request-based protocol and does not get the correct values in the Get > phase of the data transfer when using flush_local. These tests are done on > a Linux cluster using an Infiniband interconnect and the value of NSIZE is > 10000. For comparison, the same test using MPI_Type_create_subarray seems > to function reliably for all three protocols for NSIZE=1000000 using 1,2,8 > processors on 1 and 2 SMP nodes. > > > > I’ve attached the test program for these test cases. Does anyone have a > suggestion about what might be going on here? > > > > Bruce >