Thanks a lot for your analysis. This seems consistent with what I can obtain by playing around with my different test cases.
It seems that munmap() does *not* unregister the memory chunk from the cache. I suppose this is the reason for the bug. In fact using mmap(..., MAP_ANONYMOUS | MAP_PRIVATE) and munmap() as substitutes for malloc()/free() trigger the same problem. It looks to me that there is an oversight in the OPAL hooks around the memory functions, then. Do you agree ? E. On Tue, Nov 11, 2014 at 3:17 PM, Joshua Ladd <jladd.m...@gmail.com> wrote: > I was able to reproduce your issue and I think I understand the problem a > bit better at least. This demonstrates exactly what I was pointing to: > > It looks like when the test switches over from eager RDMA (I'll explain in a > second), to doing a rendezvous protocol working entirely in user buffer > space things go bad. > > If you're input is smaller than some threshold, the eager RDMA limit, then > the contents of your user buffer are copied into OMPI/OpenIB BTL scratch > buffers called "eager fragments". This pool of resources is preregistered, > pinned, and have had their rkeys exchanged. So, in the eager protocol, your > data is copied into these "locked and loaded" RDMA frags and the put/get is > handled internally. When the data is received, it's copied back out into > your buffer. In your setup, this always works. > > $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca > btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca > btl_openib_eager_limit 512 -mca btl openib,self ./ibtest -s 56 > per-node buffer has size 448 bytes > node 0 iteration 0, lead word received from peer is 0x00000401 [ok] > node 0 iteration 1, lead word received from peer is 0x00000801 [ok] > node 0 iteration 2, lead word received from peer is 0x00000c01 [ok] > node 0 iteration 3, lead word received from peer is 0x00001001 [ok] > > When you exceed the eager threshold, this always fails on the second > iteration. To understand this, you need to understand that there is a > protocol switch where now your user buffer is used for the transfer. Hence, > the user buffer is registered with the HCA. This operation is an inherently > high latency operation and is one of the primary motives for doing > copy-in/copy-out into preregistered buffers for small, latency sensitive > ops. For bandwidth bound transfers, the cost to register can be amortized > over the whole transfer, but it still affects the total bandwidth. In the > case of a rendezvous protocol where the user buffer is registered, there is > an optimization mostly used to help improve the numbers in a bandwidth > benchmark called a registration cache. With registration caching the user > buffer is registered once and the mkey put into a cache and the memory is > kept pinned until the system provides some notification via either memory > hooks in p2p malloc, or ummunotify that the buffer has been freed and this > signals that the mkey can be evicted from the cache. On subsequent > send/recv operations from the same user buffer address, OpenIB BTL will find > the address in the registration cache and take the cached mkey and avoid > paying the cost of the memory registration the memory registration and start > the data transfer. > > What I noticed is when the rendezvous protocol kicks in, it always fails on > the second iteration. > > $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca > btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca > btl_openib_eager_limit 128 -mca btl openib,self ./ibtest -s 56 > per-node buffer has size 448 bytes > node 0 iteration 0, lead word received from peer is 0x00000401 [ok] > node 0 iteration 1, lead word received from peer is 0x00000000 [NOK] > -------------------------------------------------------------------------- > > So, I suspected it has something to do with the way the virtual address is > being handled in this case. To test that theory, I just completely disabled > the registration cache by setting -mca mpi_leave_pinned 0 and things start > to work: > > $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca > btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca > btl_openib_eager_limit 128 -mca mpi_leave_pinned 0 -mca btl openib,self > ./ibtest -s 56 > per-node buffer has size 448 bytes > node 0 iteration 0, lead word received from peer is 0x00000401 [ok] > node 0 iteration 1, lead word received from peer is 0x00000801 [ok] > node 0 iteration 2, lead word received from peer is 0x00000c01 [ok] > node 0 iteration 3, lead word received from peer is 0x00001001 [ok] > > I don't know enough about memory hooks or the registration cache > implementation to speak with any authority, but it looks like this is where > the issue resides. As a workaround, can you try your original experiment > with -mca mpi_leave_pinned 0 and see if you get consistent results. > > > Josh > > > > > > On Tue, Nov 11, 2014 at 7:07 AM, Emmanuel Thomé <emmanuel.th...@gmail.com> > wrote: >> >> Hi again, >> >> I've been able to simplify my test case significantly. It now runs >> with 2 nodes, and only a single MPI_Send / MPI_Recv pair is used. >> >> The pattern is as follows. >> >> * - ranks 0 and 1 both own a local buffer. >> * - each fills it with (deterministically known) data. >> * - rank 0 collects the data from rank 1's local buffer >> * (whose contents should be no mystery), and writes this to a >> * file-backed mmaped area. >> * - rank 0 compares what it receives with what it knows it *should >> * have* received. >> >> The test fails if: >> >> * - the openib btl is used among the 2 nodes >> * - a file-backed mmaped area is used for receiving the data. >> * - the write is done to a newly created file. >> * - per-node buffer is large enough. >> >> For a per-node buffer size above 12kb (12240 bytes to be exact), my >> program fails, since the MPI_Recv does not receive the correct data >> chunk (it just gets zeroes). >> >> I attach the simplified test case. I hope someone will be able to >> reproduce the problem. >> >> Best regards, >> >> E. >> >> >> On Mon, Nov 10, 2014 at 5:48 PM, Emmanuel Thomé >> <emmanuel.th...@gmail.com> wrote: >> > Thanks for your answer. >> > >> > On Mon, Nov 10, 2014 at 4:31 PM, Joshua Ladd <jladd.m...@gmail.com> >> > wrote: >> >> Just really quick off the top of my head, mmaping relies on the virtual >> >> memory subsystem, whereas IB RDMA operations rely on physical memory >> >> being >> >> pinned (unswappable.) >> > >> > Yes. Does that mean that the result of computations should be >> > undefined if I happen to give a user buffer which corresponds to a >> > file ? That would be surprising. >> > >> >> For a large message transfer, the OpenIB BTL will >> >> register the user buffer, which will pin the pages and make them >> >> unswappable. >> > >> > Yes. But what are the semantics of pinning the VM area pointed to by >> > ptr if ptr happens to be mmaped from a file ? >> > >> >> If the data being transfered is small, you'll copy-in/out to >> >> internal bounce buffers and you shouldn't have issues. >> > >> > Are you saying that the openib layer does have provision in this case >> > for letting the RDMA happen with a pinned physical memory range, and >> > later perform the copy to the file-backed mmaped range ? That would >> > make perfect sense indeed, although I don't have enough familiarity >> > with the OMPI code to see where it happens, and more importantly >> > whether the completion properly waits for this post-RDMA copy to >> > complete. >> > >> > >> >> 1.If you try to just bcast a few kilobytes of data using this >> >> technique, do >> >> you run into issues? >> > >> > No. All "simpler" attempts were successful, unfortunately. Can you be >> > a little bit more precise about what scenario you imagine ? The >> > setting "all ranks mmap a local file, and rank 0 broadcasts there" is >> > successful. >> > >> >> 2. How large is the data in the collective (input and output), is >> >> in_place >> >> used? I'm guess it's large enough that the BTL tries to work with the >> >> user >> >> buffer. >> > >> > MPI_IN_PLACE is used in reduce_scatter and allgather in the code. >> > Collectives are with communicators of 2 nodes, and we're talking (for >> > the smallest failing run) 8kb per node (i.e. 16kb total for an >> > allgather). >> > >> > E. >> > >> >> On Mon, Nov 10, 2014 at 9:29 AM, Emmanuel Thomé >> >> <emmanuel.th...@gmail.com> >> >> wrote: >> >>> >> >>> Hi, >> >>> >> >>> I'm stumbling on a problem related to the openib btl in >> >>> openmpi-1.[78].*, and the (I think legitimate) use of file-backed >> >>> mmaped areas for receiving data through MPI collective calls. >> >>> >> >>> A test case is attached. I've tried to make is reasonably small, >> >>> although I recognize that it's not extra thin. The test case is a >> >>> trimmed down version of what I witness in the context of a rather >> >>> large program, so there is no claim of relevance of the test case >> >>> itself. It's here just to trigger the desired misbehaviour. The test >> >>> case contains some detailed information on what is done, and the >> >>> experiments I did. >> >>> >> >>> In a nutshell, the problem is as follows. >> >>> >> >>> - I do a computation, which involves MPI_Reduce_scatter and >> >>> MPI_Allgather. >> >>> - I save the result to a file (collective operation). >> >>> >> >>> *If* I save the file using something such as: >> >>> fd = open("blah", ... >> >>> area = mmap(..., fd, ) >> >>> MPI_Gather(..., area, ...) >> >>> *AND* the MPI_Reduce_scatter is done with an alternative >> >>> implementation (which I believe is correct) >> >>> *AND* communication is done through the openib btl, >> >>> >> >>> then the file which gets saved is inconsistent with what is obtained >> >>> with the normal MPI_Reduce_scatter (alghough memory areas do coincide >> >>> before the save). >> >>> >> >>> I tried to dig a bit in the openib internals, but all I've been able >> >>> to witness was beyond my expertise (an RDMA read not transferring the >> >>> expected data, but I'm too uncomfortable with this layer to say >> >>> anything I'm sure about). >> >>> >> >>> Tests have been done with several openmpi versions including 1.8.3, on >> >>> a debian wheezy (7.5) + OFED 2.3 cluster. >> >>> >> >>> It would be great if someone could tell me if he is able to reproduce >> >>> the bug, or tell me whether something which is done in this test case >> >>> is illegal in any respect. I'd be glad to provide further information >> >>> which could be of any help. >> >>> >> >>> Best regards, >> >>> >> >>> E. Thomé. >> >>> >> >>> _______________________________________________ >> >>> users mailing list >> >>> us...@open-mpi.org >> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>> Link to this post: >> >>> http://www.open-mpi.org/community/lists/users/2014/11/25730.php >> >> >> >> >> >> >> >> _______________________________________________ >> >> users mailing list >> >> us...@open-mpi.org >> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> Link to this post: >> >> http://www.open-mpi.org/community/lists/users/2014/11/25732.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/11/25740.php > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25743.php