Re: [OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?
Brice: good to know. Martin: yes, we should update README. Thanks for the heads up. > On Jun 27, 2016, at 3:24 PM, Brice Goglin wrote: > > mlock() and mlockall() only guarantee that pages won't be swapped out to the > disk. However, they don't prevent virtual pages from moving to other physical > pages (for instance during migration between NUMA nodes), which breaks memory > registration. At least this was true a couple years ago, I didn't check > recently, but I would be surprised if that semantics changed. > > Brice > > > > Le 27/06/2016 21:17, Audet, Martin a écrit : >> Thanks Jeff and Alex for your answers and comments. >> >> mlockall(), especially with the MCL_FUTURE argument is indeed interesting. >> >> Thanks Jeff for your clarification of what memory registration really means >> (e.g. locking and telling the network stack the virtual to physical mapping). >> >> Also concerning the ummunotify kernel module, I would like to point out that >> while the link sent to github bug report suggests it is problematic, the top >> level Open MPI README file still recommends it. Should the README file need >> to be updated ? >> >> Regards, >> >> Martin Audet >> >> >> >> ___ >> users mailing list >> >> us...@open-mpi.org >> >> Subscription: >> https://www.open-mpi.org/mailman/listinfo.cgi/users >> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/06/29549.php > > ___ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29550.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?
mlock() and mlockall() only guarantee that pages won't be swapped out to the disk. However, they don't prevent virtual pages from moving to other physical pages (for instance during migration between NUMA nodes), which breaks memory registration. At least this was true a couple years ago, I didn't check recently, but I would be surprised if that semantics changed. Brice Le 27/06/2016 21:17, Audet, Martin a écrit : > Thanks Jeff and Alex for your answers and comments. > > mlockall(), especially with the MCL_FUTURE argument is indeed interesting. > > Thanks Jeff for your clarification of what memory registration really > means (e.g. locking and telling the network stack the virtual to > physical mapping). > > Also concerning the ummunotify kernel module, I would like to point > out that while the link sent to github bug report suggests it is > problematic, the top level Open MPI README file still recommends it. > Should the README file need to be updated ? > > Regards, > > Martin Audet > > > > ___ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29549.php
Re: [OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?
Thanks Jeff and Alex for your answers and comments. mlockall(), especially with the MCL_FUTURE argument is indeed interesting. Thanks Jeff for your clarification of what memory registration really means (e.g. locking and telling the network stack the virtual to physical mapping). Also concerning the ummunotify kernel module, I would like to point out that while the link sent to github bug report suggests it is problematic, the top level Open MPI README file still recommends it. Should the README file need to be updated ? Regards, Martin Audet
Re: [OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?
> On Jun 20, 2016, at 4:15 PM, Audet, Martin > wrote: > > But now since we have to live with memory registration issues, what changes > should be done to standard Linux distro so that Open MPI can best use a > recent Mellanox Infiniband network ? > > I guess that installing the ummunotify kernel module is a good idea ? There may be some bit rot in our ummunotify support in Open MPI. Ah yes, here it is: https://github.com/open-mpi/ompi/issues/429 We haven't dug into the problem because ummunotify isn't upstream, and no one has apparently been using it. If someone had some cycles to look into it, that would be great. > Maybe also removing the limits on the “max locked memory” (ulimit -l) is also > good ? That's what most OS-bypass vendors recommend. > Beside that, I guess that installing the latest OFED (to have the latest > middleware) instead of using the default one coming with the Linux distro is > a good idea ? There's different religion surrounding that -- I believe there was some good discussion about that topic on the user's list recently. > Also does the XPMEM kernel module for more efficient intra node transfer of > large message worth installing since kernels now include the CMA API ? Yes, XPMEM == goodness. I know that Vader supports both XPMEM and CMA; I don't know offhand the tradeoffs between the two. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?
On Jun 20, 2016, at 4:27 PM, Alex A. Granovsky wrote: > > Would be the use of mlockall helpful for this approach? That's an interesting idea; I didn't know about the existence of mlockall(MCL_FUTURE). It has a few drawbacks, of course (e.g., processes can't shrink), but they might be acceptable tradeoffs for typical HPC/MPI application scenarios. Keep in mind that locking memory is only one half of the registration process: the other half is notifying the network stack so that they can record/track virtual memory-->physical memory mapping. Meaning: Open MPI will still need its registration cache infrastructure -- but it *might* be able to be slightly simpler because the eviction mechanisms will never be invoked. Also keep in mind that the memory mechanisms -- regardless of whether it's mlockall(MCL_FUTURE) or the newly-revamped "patcher" system in the upcoming Open MPI v2.0.0 -- are not in the critical performance path (e.g., that stuff doesn't happen during a call to MPI_SEND). The part that *is* in the critical performance path is the registration cache -- i.e., the part where Open MPI asks "is this buffer/memory registered [with the network stack]?" That part is designed to be fast and, at least at the moment, will still need to be there. If there's some kind of equivalent to mlockall(MCL_FUTURE) that *also* transparently registers all new memory with the relevant underlying network stack(s) and contexts, that would be neat. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?
Would be the use of mlockall helpful for this approach? From: Audet, Martin Sent: Monday, June 20, 2016 11:15 PM To: mailto:us...@open-mpi.org Subject: Re: [OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ? Thanks Jeff for your answer. It is sad that the approach I mentioned of having all memory registered for user process on cluster nodes didn’t become more popular. I still believe that such an approach would shorten the executed code path in MPI libraries, reduce message latency, increase the communication/computation overlap potential and allows communication progress more naturally. But now since we have to live with memory registration issues, what changes should be done to standard Linux distro so that Open MPI can best use a recent Mellanox Infiniband network ? I guess that installing the ummunotify kernel module is a good idea ? Maybe also removing the limits on the “max locked memory” (ulimit -l) is also good ? Beside that, I guess that installing the latest OFED (to have the latest middleware) instead of using the default one coming with the Linux distro is a good idea ? Also does the XPMEM kernel module for more efficient intra node transfer of large message worth installing since kernels now include the CMA API ? Thanks, Martin Audet ___ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/06/29490.php
Re: [OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?
Thanks Jeff for your answer. It is sad that the approach I mentioned of having all memory registered for user process on cluster nodes didn't become more popular. I still believe that such an approach would shorten the executed code path in MPI libraries, reduce message latency, increase the communication/computation overlap potential and allows communication progress more naturally. But now since we have to live with memory registration issues, what changes should be done to standard Linux distro so that Open MPI can best use a recent Mellanox Infiniband network ? I guess that installing the ummunotify kernel module is a good idea ? Maybe also removing the limits on the "max locked memory" (ulimit -l) is also good ? Beside that, I guess that installing the latest OFED (to have the latest middleware) instead of using the default one coming with the Linux distro is a good idea ? Also does the XPMEM kernel module for more efficient intra node transfer of large message worth installing since kernels now include the CMA API ? Thanks, Martin Audet
Re: [OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?
Greetings Martin. Such approaches have been discussed in the past. Indeed, I'm pretty sure that I've heard of some non-commodity systems / network stacks that do this kind of thing. Such approaches have not evolved in the commodity Linux space, however. This kind of support would need better hooks than what exist today; new hooks would be needed for integration between the memory allocator (e.g., all the allocation methods in glibc) and the underlying network stack(s). E.g.: - hook when memory is attached to the process - hook when memory is detached from the process - allow multiple hooks to co-exist in the same userspace process simultaneously Ultimately, memory attach/detach is controlled by the kernel. My $0.02 is that an ultimate solution would need to have some kind of kernel aspect to it. In the past, Linus has been (probably rightfully) resistant to adding such solutions for the general case, because these problems are really fairly specific to OS-bypass network stacks (i.e., the drivers/infiniband area in the kernel). His prior response when this topic came up back in 2009 was basically "fix your own network stack." That being said, if someone would like to advance work in this area -- particularly to include a solution in the drivers/infiniband section of the Linux kernel, I think that would be great. > On Jun 16, 2016, at 3:59 PM, Audet, Martin > wrote: > > Hi, > > After reading a little the FAQ on the methods used by Open MPI to deal with > memory registration (or pinning) with Infiniband adapter, it seems that we > could avoid all the overhead and complexity of memory > registration/deregistration, registration cache access and update, memory > management (ummunotify) in addition to allowing a better overlap of the > communications with the computations (we could let the communication hardware > do its job independently without resorting to > registration/transfer/deregistration pipelines) by simply having all user > process memory registered all the time. > > Of course a configuration like that is not appropriate in a general setting > (ex: a desktop environment) as it would make swapping almost impossible. > > But in the context of an HPC node where the processes are not supposed to > swap and the OS not overcommit memory, not being able to swap doesn’t appear > to be a problem. > > Moreover since the maximal total memory used per process is often predefined > at the application start as a resource specified to the queuing system, the > OS could easily keep a defined amount of extra memory for its own need > instead of swapping out user process memory. > > I guess that specialized (non-Linux) compute node OS does this. > > But is it possible and does it make sense with Linux ? > > Thanks, > > Martin Audet > > ___ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29470.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?
Hi, After reading a little the FAQ on the methods used by Open MPI to deal with memory registration (or pinning) with Infiniband adapter, it seems that we could avoid all the overhead and complexity of memory registration/deregistration, registration cache access and update, memory management (ummunotify) in addition to allowing a better overlap of the communications with the computations (we could let the communication hardware do its job independently without resorting to registration/transfer/deregistration pipelines) by simply having all user process memory registered all the time. Of course a configuration like that is not appropriate in a general setting (ex: a desktop environment) as it would make swapping almost impossible. But in the context of an HPC node where the processes are not supposed to swap and the OS not overcommit memory, not being able to swap doesn't appear to be a problem. Moreover since the maximal total memory used per process is often predefined at the application start as a resource specified to the queuing system, the OS could easily keep a defined amount of extra memory for its own need instead of swapping out user process memory. I guess that specialized (non-Linux) compute node OS does this. But is it possible and does it make sense with Linux ? Thanks, Martin Audet