On May 28, 2012, at 3:48 PM, François Guertin wrote:

> I try to bind both the memory and processes on our compute cluster nodes
> but only the processes binding works. How can I also specify to allocate
> the memory on the same numa node than where the process is bind? I tried
> with the option  "--mca hwloc_base_mem_alloc_policy local_only" without
> any luck.
> [snip]

Thanks for your very detailed message -- it made it possible to completely 
understand your question, and (hopefully) answer it properly.  :-)

I think the issue here is that the help message for hwloc_base_mem_alloc_policy 
isn't quite worded properly:

>> MCA hwloc: parameter "hwloc_base_mem_alloc_policy" (current value:
>> <none>, data source: default value)
>>                          Policy that determines how general memory
>> allocations are bound after MPI_INIT.  A value of "none" means that no
>> memory policy is applied.  A value of "local_only" means that all
>> memory allocations will be restricted to the local NUMA node where
>> each process is placed.  Note that operating system paging policies
>> are unaffected by this setting.  For example, if "local_only" is used
>> and local NUMA node memory is exhausted, a new memory allocation may
>> cause paging.

At issue is the fact that I probably should not have used to word "bound" in 
the first sentence, and probably clarified that memory is *not* bound. 

Specifically, when you set hwloc_base_mem_alloc_policy to "local_only", that 
only sets the policy of where newly malloced code is placed.  Even more 
specifically: it does *not* bind the memory, meaning that if your process' 
memory is swapped out, it could get swapped in to a new location (yoinks!).

That being said, most HPC apps don't swap, so it's *usually* not an issue.  
But, of course, after you malloc memory (which will be physically located on 
your local NUMA node), you could bind it, too, if you want.

Open MPI doesn't bind user-allocated memory (except possibly those that are 
passed as message buffers to functions like MPI_SEND and MPI_RECV) because that 
would mean that we have to intercept calls like malloc, calloc, etc.  And we 
don't really want to be in that business.

(disclaimer: we sorta do intercept malloc, calloc, etc. in some cases -- but we 
really don't want to, and don't do it in all cases.  I can explain more if you 
care)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to