Re: [OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?

2016-06-27 Thread Brice Goglin
mlock() and mlockall() only guarantee that pages won't be swapped out to
the disk. However, they don't prevent virtual pages from moving to other
physical pages (for instance during migration between NUMA nodes), which
breaks memory registration. At least this was true a couple years ago, I
didn't check recently, but I would be surprised if that semantics changed.

Brice



Le 27/06/2016 21:17, Audet, Martin a écrit :
> Thanks Jeff and Alex for your answers and comments.
>  
> mlockall(), especially with the MCL_FUTURE argument is indeed interesting.
>  
> Thanks Jeff for your clarification of what memory registration really
> means (e.g. locking and telling the network stack the virtual to
> physical mapping).
>  
> Also concerning the ummunotify kernel module, I would like to point
> out that while the link sent to github bug report suggests it is
> problematic, the top level Open MPI README file still recommends it.
> Should the README file need to be updated ?
>  
> Regards,
>  
> Martin Audet
>  
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29549.php



Re: [OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?

2016-06-27 Thread Audet, Martin
Thanks Jeff and Alex for your answers and comments.

mlockall(), especially with the MCL_FUTURE argument is indeed interesting.

Thanks Jeff for your clarification of what memory registration really means 
(e.g. locking and telling the network stack the virtual to physical mapping).

Also concerning the ummunotify kernel module, I would like to point out that 
while the link sent to github bug report suggests it is problematic, the top 
level Open MPI README file still recommends it. Should the README file need to 
be updated ?

Regards,

Martin Audet



Re: [OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?

2016-06-21 Thread Jeff Squyres (jsquyres)
> On Jun 20, 2016, at 4:15 PM, Audet, Martin  
> wrote:
> 
> But now since we have to live with memory registration issues, what changes 
> should be done to standard Linux distro so that Open MPI can best use a 
> recent Mellanox Infiniband network ?
>  
> I guess that installing the ummunotify kernel module is a good idea ?

There may be some bit rot in our ummunotify support in Open MPI.  Ah yes, here 
it is:

https://github.com/open-mpi/ompi/issues/429

We haven't dug into the problem because ummunotify isn't upstream, and no one 
has apparently been using it.  If someone had some cycles to look into it, that 
would be great.

> Maybe also removing the limits on the “max locked memory” (ulimit -l) is also 
> good ? 

That's what most OS-bypass vendors recommend.

> Beside that, I guess that installing the latest OFED (to have the latest 
> middleware) instead of using the default one coming with the Linux distro is 
> a good idea ?

There's different religion surrounding that -- I believe there was some good 
discussion about that topic on the user's list recently.

> Also does the XPMEM kernel module for more efficient intra node transfer of 
> large message worth installing since kernels now include the CMA API ?

Yes, XPMEM == goodness.  I know that Vader supports both XPMEM and CMA; I don't 
know offhand the tradeoffs between the two.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?

2016-06-21 Thread Jeff Squyres (jsquyres)
On Jun 20, 2016, at 4:27 PM, Alex A. Granovsky  wrote:
> 
> Would be the use of  mlockall helpful for this approach?

That's an interesting idea; I didn't know about the existence of 
mlockall(MCL_FUTURE).

It has a few drawbacks, of course (e.g., processes can't shrink), but they 
might be acceptable tradeoffs for typical HPC/MPI application scenarios.

Keep in mind that locking memory is only one half of the registration process: 
the other half is notifying the network stack so that they can record/track 
virtual memory-->physical memory mapping.  Meaning: Open MPI will still need 
its registration cache infrastructure -- but it *might* be able to be slightly 
simpler because the eviction mechanisms will never be invoked.

Also keep in mind that the memory mechanisms -- regardless of whether it's 
mlockall(MCL_FUTURE) or the newly-revamped "patcher" system in the upcoming 
Open MPI v2.0.0 -- are not in the critical performance path (e.g., that stuff 
doesn't happen during a call to MPI_SEND).  The part that *is* in the critical 
performance path is the registration cache -- i.e., the part where Open MPI 
asks "is this buffer/memory registered [with the network stack]?"  That part is 
designed to be fast and, at least at the moment, will still need to be there.

If there's some kind of equivalent to mlockall(MCL_FUTURE) that *also* 
transparently registers all new memory with the relevant underlying network 
stack(s) and contexts, that would be neat.  

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?

2016-06-20 Thread Alex A. Granovsky
Would be the use of  mlockall helpful for this approach?

From: Audet, Martin 
Sent: Monday, June 20, 2016 11:15 PM
To: mailto:us...@open-mpi.org 
Subject: Re: [OMPI users] Avoiding the memory registration costs by having 
memory always registered, is it possible with Linux ?

Thanks Jeff for your answer.

It is sad that the approach I mentioned of having all memory registered for 
user process on cluster nodes didn’t become more popular.

I still believe that such an approach would shorten the executed code path in 
MPI libraries, reduce message latency, increase the communication/computation 
overlap potential and allows communication progress more naturally.

But now since we have to live with memory registration issues, what changes 
should be done to standard Linux distro so that Open MPI can best use a recent 
Mellanox Infiniband network ?

I guess that installing the ummunotify kernel module is a good idea ?

Maybe also removing the limits on the “max locked memory” (ulimit -l) is also 
good ? 

Beside that, I guess that installing the latest OFED (to have the latest 
middleware) instead of using the default one coming with the Linux distro is a 
good idea ?

Also does the XPMEM kernel module for more efficient intra node transfer of 
large message worth installing since kernels now include the CMA API ?

Thanks,

Martin Audet




___
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/06/29490.php

Re: [OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?

2016-06-20 Thread Audet, Martin
Thanks Jeff for your answer.

It is sad that the approach I mentioned of having all memory registered for 
user process on cluster nodes didn't become more popular.

I still believe that such an approach would shorten the executed code path in 
MPI libraries, reduce message latency, increase the communication/computation 
overlap potential and allows communication progress more naturally.

But now since we have to live with memory registration issues, what changes 
should be done to standard Linux distro so that Open MPI can best use a recent 
Mellanox Infiniband network ?

I guess that installing the ummunotify kernel module is a good idea ?

Maybe also removing the limits on the "max locked memory" (ulimit -l) is also 
good ?

Beside that, I guess that installing the latest OFED (to have the latest 
middleware) instead of using the default one coming with the Linux distro is a 
good idea ?

Also does the XPMEM kernel module for more efficient intra node transfer of 
large message worth installing since kernels now include the CMA API ?

Thanks,

Martin Audet



Re: [OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?

2016-06-18 Thread Jeff Squyres (jsquyres)
Greetings Martin.

Such approaches have been discussed in the past.  Indeed, I'm pretty sure that 
I've heard of some non-commodity systems / network stacks that do this kind of 
thing.

Such approaches have not evolved in the commodity Linux space, however.  This 
kind of support would need better hooks than what exist today; new hooks would 
be needed for integration between the memory allocator (e.g., all the 
allocation methods in glibc) and the underlying network stack(s).  E.g.:

- hook when memory is attached to the process
- hook when memory is detached from the process
- allow multiple hooks to co-exist in the same userspace process simultaneously

Ultimately, memory attach/detach is controlled by the kernel.  My $0.02 is that 
an ultimate solution would need to have some kind of kernel aspect to it.

In the past, Linus has been (probably rightfully) resistant to adding such 
solutions for the general case, because these problems are really fairly 
specific to OS-bypass network stacks (i.e., the drivers/infiniband area in the 
kernel).  His prior response when this topic came up back in 2009 was basically 
"fix your own network stack."

That being said, if someone would like to advance work in this area -- 
particularly to include a solution in the drivers/infiniband section of the 
Linux kernel, I think that would be great.



> On Jun 16, 2016, at 3:59 PM, Audet, Martin  
> wrote:
> 
> Hi,
>  
> After reading a little the FAQ on the methods used by Open MPI to deal with 
> memory registration (or pinning) with Infiniband adapter, it seems that we 
> could avoid all the overhead and complexity of memory 
> registration/deregistration, registration cache access and update, memory 
> management (ummunotify) in addition to allowing a better overlap of the 
> communications with the computations (we could let the communication hardware 
> do its job independently without resorting to 
> registration/transfer/deregistration pipelines) by simply having all user 
> process memory registered all the time.
>  
> Of course a configuration like that is not appropriate in a general setting 
> (ex: a desktop environment) as it would make swapping almost impossible.
>  
> But in the context of an HPC node where the processes are not supposed to 
> swap and the OS not overcommit memory, not being able to swap doesn’t appear 
> to be a problem.
>  
> Moreover since the maximal total memory used per process is often predefined 
> at the application start as a resource specified to the queuing system, the 
> OS could easily keep a defined amount of extra memory for its own need 
> instead of swapping out user process memory. 
>  
> I guess that specialized (non-Linux) compute node OS does this.
>  
> But is it possible and does it make sense with Linux ?
>  
> Thanks,
>  
> Martin Audet
>  
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29470.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?

2016-06-16 Thread Audet, Martin
Hi,

After reading a little the FAQ on the methods used by Open MPI to deal with 
memory registration (or pinning) with Infiniband adapter, it seems that we 
could avoid all the overhead and complexity of memory 
registration/deregistration, registration cache access and update, memory 
management (ummunotify) in addition to allowing a better overlap of the 
communications with the computations (we could let the communication hardware 
do its job independently without resorting to 
registration/transfer/deregistration pipelines) by simply having all user 
process memory registered all the time.

Of course a configuration like that is not appropriate in a general setting 
(ex: a desktop environment) as it would make swapping almost impossible.

But in the context of an HPC node where the processes are not supposed to swap 
and the OS not overcommit memory, not being able to swap doesn't appear to be a 
problem.

Moreover since the maximal total memory used per process is often predefined at 
the application start as a resource specified to the queuing system, the OS 
could easily keep a defined amount of extra memory for its own need instead of 
swapping out user process memory.

I guess that specialized (non-Linux) compute node OS does this.

But is it possible and does it make sense with Linux ?

Thanks,

Martin Audet