Thanks a lot for your analysis. This seems consistent with what I can
obtain by playing around with my different test cases.

It seems that munmap() does *not* unregister the memory chunk from the
cache. I suppose this is the reason for the bug.

In fact using mmap(..., MAP_ANONYMOUS | MAP_PRIVATE) and munmap() as
substitutes for malloc()/free() trigger the same problem.

It looks to me that there is an oversight in the OPAL hooks around the
memory functions, then. Do you agree ?

E.

On Tue, Nov 11, 2014 at 3:17 PM, Joshua Ladd <jladd.m...@gmail.com> wrote:
> I was able to reproduce your issue and I think I understand the problem a
> bit better at least. This demonstrates exactly what I was pointing to:
>
> It looks like when the test switches over from eager RDMA (I'll explain in a
> second), to doing a rendezvous protocol working entirely in user buffer
> space things go bad.
>
> If you're input is smaller than some threshold, the eager RDMA limit, then
> the contents of your user buffer are copied into OMPI/OpenIB BTL scratch
> buffers called "eager fragments". This pool of resources is preregistered,
> pinned, and have had their rkeys exchanged. So, in the eager protocol, your
> data is copied into these "locked and loaded" RDMA frags and the put/get is
> handled internally. When the data is received, it's copied back out into
> your buffer. In your setup, this always works.
>
> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca
> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca
> btl_openib_eager_limit 512 -mca btl openib,self ./ibtest -s 56
> per-node buffer has size 448 bytes
> node 0 iteration 0, lead word received from peer is 0x00000401 [ok]
> node 0 iteration 1, lead word received from peer is 0x00000801 [ok]
> node 0 iteration 2, lead word received from peer is 0x00000c01 [ok]
> node 0 iteration 3, lead word received from peer is 0x00001001 [ok]
>
> When you exceed the eager threshold, this always fails on the second
> iteration. To understand this, you need to understand that there is a
> protocol switch where now your user buffer is used for the transfer. Hence,
> the user buffer is registered with the HCA. This operation is an inherently
> high latency operation and is one of the primary motives for doing
> copy-in/copy-out into preregistered buffers for small, latency sensitive
> ops. For bandwidth bound transfers, the cost to register can be amortized
> over the whole transfer, but it still affects the total bandwidth. In the
> case of a rendezvous protocol where the user buffer is registered, there is
> an optimization mostly used to help improve the numbers in a bandwidth
> benchmark called a registration cache. With registration caching the user
> buffer is registered once and the mkey put into a cache and the memory is
> kept pinned until the system provides some notification via either memory
> hooks in p2p malloc, or ummunotify that the buffer has been freed and this
> signals that the mkey can be evicted from the cache.  On subsequent
> send/recv operations from the same user buffer address, OpenIB BTL will find
> the address in the registration cache and take the cached mkey and avoid
> paying the cost of the memory registration the memory registration and start
> the data transfer.
>
> What I noticed is when the rendezvous protocol kicks in, it always fails on
> the second iteration.
>
> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca
> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca
> btl_openib_eager_limit 128 -mca btl openib,self ./ibtest -s 56
> per-node buffer has size 448 bytes
> node 0 iteration 0, lead word received from peer is 0x00000401 [ok]
> node 0 iteration 1, lead word received from peer is 0x00000000 [NOK]
> --------------------------------------------------------------------------
>
> So, I suspected it has something to do with the way the virtual address is
> being handled in this case. To test that theory, I just completely disabled
> the registration cache by setting -mca mpi_leave_pinned 0 and things start
> to work:
>
> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca
> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca
> btl_openib_eager_limit 128 -mca mpi_leave_pinned 0 -mca btl openib,self
> ./ibtest -s 56
> per-node buffer has size 448 bytes
> node 0 iteration 0, lead word received from peer is 0x00000401 [ok]
> node 0 iteration 1, lead word received from peer is 0x00000801 [ok]
> node 0 iteration 2, lead word received from peer is 0x00000c01 [ok]
> node 0 iteration 3, lead word received from peer is 0x00001001 [ok]
>
> I don't know enough about memory hooks or the registration cache
> implementation to speak with any authority, but it looks like this is where
> the issue resides. As a workaround, can you try your original experiment
> with -mca mpi_leave_pinned 0 and see if you get consistent results.
>
>
> Josh
>
>
>
>
>
> On Tue, Nov 11, 2014 at 7:07 AM, Emmanuel Thomé <emmanuel.th...@gmail.com>
> wrote:
>>
>> Hi again,
>>
>> I've been able to simplify my test case significantly. It now runs
>> with 2 nodes, and only a single MPI_Send / MPI_Recv pair is used.
>>
>> The pattern is as follows.
>>
>>  *  - ranks 0 and 1 both own a local buffer.
>>  *  - each fills it with (deterministically known) data.
>>  *  - rank 0 collects the data from rank 1's local buffer
>>  *    (whose contents should be no mystery), and writes this to a
>>  *    file-backed mmaped area.
>>  *  - rank 0 compares what it receives with what it knows it *should
>>  *  have* received.
>>
>> The test fails if:
>>
>>  *  - the openib btl is used among the 2 nodes
>>  *  - a file-backed mmaped area is used for receiving the data.
>>  *  - the write is done to a newly created file.
>>  *  - per-node buffer is large enough.
>>
>> For a per-node buffer size above 12kb (12240 bytes to be exact), my
>> program fails, since the MPI_Recv does not receive the correct data
>> chunk (it just gets zeroes).
>>
>> I attach the simplified test case. I hope someone will be able to
>> reproduce the problem.
>>
>> Best regards,
>>
>> E.
>>
>>
>> On Mon, Nov 10, 2014 at 5:48 PM, Emmanuel Thomé
>> <emmanuel.th...@gmail.com> wrote:
>> > Thanks for your answer.
>> >
>> > On Mon, Nov 10, 2014 at 4:31 PM, Joshua Ladd <jladd.m...@gmail.com>
>> > wrote:
>> >> Just really quick off the top of my head, mmaping relies on the virtual
>> >> memory subsystem, whereas IB RDMA operations rely on physical memory
>> >> being
>> >> pinned (unswappable.)
>> >
>> > Yes. Does that mean that the result of computations should be
>> > undefined if I happen to give a user buffer which corresponds to a
>> > file ? That would be surprising.
>> >
>> >> For a large message transfer, the OpenIB BTL will
>> >> register the user buffer, which will pin the pages and make them
>> >> unswappable.
>> >
>> > Yes. But what are the semantics of pinning the VM area pointed to by
>> > ptr if ptr happens to be mmaped from a file ?
>> >
>> >> If the data being transfered is small, you'll copy-in/out to
>> >> internal bounce buffers and you shouldn't have issues.
>> >
>> > Are you saying that the openib layer does have provision in this case
>> > for letting the RDMA happen with a pinned physical memory range, and
>> > later perform the copy to the file-backed mmaped range ? That would
>> > make perfect sense indeed, although I don't have enough familiarity
>> > with the OMPI code to see where it happens, and more importantly
>> > whether the completion properly waits for this post-RDMA copy to
>> > complete.
>> >
>> >
>> >> 1.If you try to just bcast a few kilobytes of data using this
>> >> technique, do
>> >> you run into issues?
>> >
>> > No. All "simpler" attempts were successful, unfortunately. Can you be
>> > a little bit more precise about what scenario you imagine ? The
>> > setting "all ranks mmap a local file, and rank 0 broadcasts there" is
>> > successful.
>> >
>> >> 2. How large is the data in the collective (input and output), is
>> >> in_place
>> >> used? I'm guess it's large enough that the BTL tries to work with the
>> >> user
>> >> buffer.
>> >
>> > MPI_IN_PLACE is used in reduce_scatter and allgather in the code.
>> > Collectives are with communicators of 2 nodes, and we're talking (for
>> > the smallest failing run) 8kb per node (i.e. 16kb total for an
>> > allgather).
>> >
>> > E.
>> >
>> >> On Mon, Nov 10, 2014 at 9:29 AM, Emmanuel Thomé
>> >> <emmanuel.th...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> I'm stumbling on a problem related to the openib btl in
>> >>> openmpi-1.[78].*, and the (I think legitimate) use of file-backed
>> >>> mmaped areas for receiving data through MPI collective calls.
>> >>>
>> >>> A test case is attached. I've tried to make is reasonably small,
>> >>> although I recognize that it's not extra thin. The test case is a
>> >>> trimmed down version of what I witness in the context of a rather
>> >>> large program, so there is no claim of relevance of the test case
>> >>> itself. It's here just to trigger the desired misbehaviour. The test
>> >>> case contains some detailed information on what is done, and the
>> >>> experiments I did.
>> >>>
>> >>> In a nutshell, the problem is as follows.
>> >>>
>> >>>  - I do a computation, which involves MPI_Reduce_scatter and
>> >>> MPI_Allgather.
>> >>>  - I save the result to a file (collective operation).
>> >>>
>> >>> *If* I save the file using something such as:
>> >>>  fd = open("blah", ...
>> >>>  area = mmap(..., fd, )
>> >>>  MPI_Gather(..., area, ...)
>> >>> *AND* the MPI_Reduce_scatter is done with an alternative
>> >>> implementation (which I believe is correct)
>> >>> *AND* communication is done through the openib btl,
>> >>>
>> >>> then the file which gets saved is inconsistent with what is obtained
>> >>> with the normal MPI_Reduce_scatter (alghough memory areas do coincide
>> >>> before the save).
>> >>>
>> >>> I tried to dig a bit in the openib internals, but all I've been able
>> >>> to witness was beyond my expertise (an RDMA read not transferring the
>> >>> expected data, but I'm too uncomfortable with this layer to say
>> >>> anything I'm sure about).
>> >>>
>> >>> Tests have been done with several openmpi versions including 1.8.3, on
>> >>> a debian wheezy (7.5) + OFED 2.3 cluster.
>> >>>
>> >>> It would be great if someone could tell me if he is able to reproduce
>> >>> the bug, or tell me whether something which is done in this test case
>> >>> is illegal in any respect. I'd be glad to provide further information
>> >>> which could be of any help.
>> >>>
>> >>> Best regards,
>> >>>
>> >>> E. Thomé.
>> >>>
>> >>> _______________________________________________
>> >>> users mailing list
>> >>> us...@open-mpi.org
>> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>> Link to this post:
>> >>> http://www.open-mpi.org/community/lists/users/2014/11/25730.php
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> users mailing list
>> >> us...@open-mpi.org
>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >> Link to this post:
>> >> http://www.open-mpi.org/community/lists/users/2014/11/25732.php
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2014/11/25740.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/11/25743.php

Reply via email to