Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

2017-11-13 Thread Brice Goglin
The doc is wrong, flags are used, only for BY_NODESET. I actually fixed
that in git very recently.

Brice



Le 13/11/2017 07:24, Biddiscombe, John A. a écrit :
> In the documentation for get_area_memlocation it says
> "If HWLOC_MEMBIND_BYNODESET is specified, set is considered a nodeset. 
> Otherwise it's a cpuset."
>
> but it also says "Flags are currently unused."
>
> so where should the BY_NODESET policy be used? Does it have to be used with 
> the original alloc call?
>
> thanks
>
> JB
>
> 
> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of 
> Biddiscombe, John A. [biddi...@cscs.ch]
> Sent: 13 November 2017 14:59
> To: Hardware locality user list
> Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
>
> Brice
>
> aha. thanks. I knew I'd seen a function for that, but couldn't remember what 
> it was.
>
> Cheers
>
> JB
> 
> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Brice 
> Goglin [brice.gog...@inria.fr]
> Sent: 13 November 2017 14:57
> To: Hardware locality user list
> Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
>
> Use get_area_memlocation()
>
> membind() returns where the pages are *allowed* to go (anywhere)
> memlocation() returns where the pages are actually allocated.
>
> Brice
>
>
>
>
> Le 13/11/2017 06:52, Biddiscombe, John A. a écrit :
>> Thank you to you both.
>>
>> I modified the allocator to allocate one large block using hwloc_alloc and 
>> then use one thread per numa domain to  touch each page according to the 
>> tiling pattern - unfortunately, I hadn't appreciated that now
>> hwloc_get_area_membind_nodeset
>> always returns the full machine numa mask - and not the numa domain that the 
>> page was touched by (I guess it only gives the expected answer when 
>> set_area_membind is used first)
>>
>> I had hoped to use a dynamic query of the pages (using the first one of a 
>> given tile) to schedule each task that operates on a given tile to run on 
>> the numa node that touched it.
>>
>> I can work around this by using a matrix offset calculation to get the numa 
>> node, but if there's a way of querying the page directly - then please let 
>> me know.
>>
>> Thanks
>>
>> JB
>> 
>> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of 
>> Samuel Thibault [samuel.thiba...@inria.fr]
>> Sent: 12 November 2017 10:48
>> To: Hardware locality user list
>> Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
>>
>> Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote:
>>> That's likely what's happening. Each set_area() may be creating a new 
>>> "virtual
>>> memory area". The kernel tries to merge them with neighbors if they go to 
>>> the
>>> same NUMA node. Otherwise it creates a new VMA.
>> Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to
>> strictly bind the memory, but just to allocate on a given memory
>> node, and just hope that the allocation will not go away (e.g. due to
>> swapping), which thus doesn't need a VMA to record the information. As
>> you describe below, first-touch achieves that but it's not necessarily
>> so convenient.
>>
>>> I can't find the exact limit but it's something like 64k so I guess
>>> you're exhausting that.
>> It's sysctl vm.max_map_count
>>
>>> Question 2 : Is there a better way of achieving the result I'm looking 
>>> for
>>> (such as a call to membind with a stride of some kind to say put N 
>>> pages in
>>> a row on each domain in alternation).
>>>
>>>
>>> Unfortunately, the interleave policy doesn't have a stride argument. It's 
>>> one
>>> page on node 0, one page on node 1, etc.
>>>
>>> The only idea I have is to use the first-touch policy: Make sure your buffer
>>> isn't is physical memory yet, and have a thread on node 0 read the "0" 
>>> pages,
>>> and another thread on node 1 read the "1" page.
>> Or "next-touch" if that was to ever get merged into mainline Linux :)
>>
>> Samuel
>> ___
>> hwloc-users mailing list
>> hwloc-users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
>> ___
>> hwloc-users mailing list
>> hwloc-users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users

___

Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

2017-11-13 Thread Biddiscombe, John A.
In the documentation for get_area_memlocation it says
"If HWLOC_MEMBIND_BYNODESET is specified, set is considered a nodeset. 
Otherwise it's a cpuset."

but it also says "Flags are currently unused."

so where should the BY_NODESET policy be used? Does it have to be used with the 
original alloc call?

thanks

JB


From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of 
Biddiscombe, John A. [biddi...@cscs.ch]
Sent: 13 November 2017 14:59
To: Hardware locality user list
Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

Brice

aha. thanks. I knew I'd seen a function for that, but couldn't remember what it 
was.

Cheers

JB

From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Brice 
Goglin [brice.gog...@inria.fr]
Sent: 13 November 2017 14:57
To: Hardware locality user list
Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

Use get_area_memlocation()

membind() returns where the pages are *allowed* to go (anywhere)
memlocation() returns where the pages are actually allocated.

Brice




Le 13/11/2017 06:52, Biddiscombe, John A. a écrit :
> Thank you to you both.
>
> I modified the allocator to allocate one large block using hwloc_alloc and 
> then use one thread per numa domain to  touch each page according to the 
> tiling pattern - unfortunately, I hadn't appreciated that now
> hwloc_get_area_membind_nodeset
> always returns the full machine numa mask - and not the numa domain that the 
> page was touched by (I guess it only gives the expected answer when 
> set_area_membind is used first)
>
> I had hoped to use a dynamic query of the pages (using the first one of a 
> given tile) to schedule each task that operates on a given tile to run on the 
> numa node that touched it.
>
> I can work around this by using a matrix offset calculation to get the numa 
> node, but if there's a way of querying the page directly - then please let me 
> know.
>
> Thanks
>
> JB
> 
> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of 
> Samuel Thibault [samuel.thiba...@inria.fr]
> Sent: 12 November 2017 10:48
> To: Hardware locality user list
> Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
>
> Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote:
>> That's likely what's happening. Each set_area() may be creating a new 
>> "virtual
>> memory area". The kernel tries to merge them with neighbors if they go to the
>> same NUMA node. Otherwise it creates a new VMA.
> Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to
> strictly bind the memory, but just to allocate on a given memory
> node, and just hope that the allocation will not go away (e.g. due to
> swapping), which thus doesn't need a VMA to record the information. As
> you describe below, first-touch achieves that but it's not necessarily
> so convenient.
>
>> I can't find the exact limit but it's something like 64k so I guess
>> you're exhausting that.
> It's sysctl vm.max_map_count
>
>> Question 2 : Is there a better way of achieving the result I'm looking 
>> for
>> (such as a call to membind with a stride of some kind to say put N pages 
>> in
>> a row on each domain in alternation).
>>
>>
>> Unfortunately, the interleave policy doesn't have a stride argument. It's one
>> page on node 0, one page on node 1, etc.
>>
>> The only idea I have is to use the first-touch policy: Make sure your buffer
>> isn't is physical memory yet, and have a thread on node 0 read the "0" pages,
>> and another thread on node 1 read the "1" page.
> Or "next-touch" if that was to ever get merged into mainline Linux :)
>
> Samuel
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

2017-11-13 Thread Biddiscombe, John A.
Brice

aha. thanks. I knew I'd seen a function for that, but couldn't remember what it 
was.

Cheers

JB

From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Brice 
Goglin [brice.gog...@inria.fr]
Sent: 13 November 2017 14:57
To: Hardware locality user list
Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

Use get_area_memlocation()

membind() returns where the pages are *allowed* to go (anywhere)
memlocation() returns where the pages are actually allocated.

Brice




Le 13/11/2017 06:52, Biddiscombe, John A. a écrit :
> Thank you to you both.
>
> I modified the allocator to allocate one large block using hwloc_alloc and 
> then use one thread per numa domain to  touch each page according to the 
> tiling pattern - unfortunately, I hadn't appreciated that now
> hwloc_get_area_membind_nodeset
> always returns the full machine numa mask - and not the numa domain that the 
> page was touched by (I guess it only gives the expected answer when 
> set_area_membind is used first)
>
> I had hoped to use a dynamic query of the pages (using the first one of a 
> given tile) to schedule each task that operates on a given tile to run on the 
> numa node that touched it.
>
> I can work around this by using a matrix offset calculation to get the numa 
> node, but if there's a way of querying the page directly - then please let me 
> know.
>
> Thanks
>
> JB
> 
> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of 
> Samuel Thibault [samuel.thiba...@inria.fr]
> Sent: 12 November 2017 10:48
> To: Hardware locality user list
> Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
>
> Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote:
>> That's likely what's happening. Each set_area() may be creating a new 
>> "virtual
>> memory area". The kernel tries to merge them with neighbors if they go to the
>> same NUMA node. Otherwise it creates a new VMA.
> Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to
> strictly bind the memory, but just to allocate on a given memory
> node, and just hope that the allocation will not go away (e.g. due to
> swapping), which thus doesn't need a VMA to record the information. As
> you describe below, first-touch achieves that but it's not necessarily
> so convenient.
>
>> I can't find the exact limit but it's something like 64k so I guess
>> you're exhausting that.
> It's sysctl vm.max_map_count
>
>> Question 2 : Is there a better way of achieving the result I'm looking 
>> for
>> (such as a call to membind with a stride of some kind to say put N pages 
>> in
>> a row on each domain in alternation).
>>
>>
>> Unfortunately, the interleave policy doesn't have a stride argument. It's one
>> page on node 0, one page on node 1, etc.
>>
>> The only idea I have is to use the first-touch policy: Make sure your buffer
>> isn't is physical memory yet, and have a thread on node 0 read the "0" pages,
>> and another thread on node 1 read the "1" page.
> Or "next-touch" if that was to ever get merged into mainline Linux :)
>
> Samuel
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

2017-11-13 Thread Brice Goglin
Use get_area_memlocation()

membind() returns where the pages are *allowed* to go (anywhere)
memlocation() returns where the pages are actually allocated.

Brice




Le 13/11/2017 06:52, Biddiscombe, John A. a écrit :
> Thank you to you both.
>
> I modified the allocator to allocate one large block using hwloc_alloc and 
> then use one thread per numa domain to  touch each page according to the 
> tiling pattern - unfortunately, I hadn't appreciated that now
> hwloc_get_area_membind_nodeset
> always returns the full machine numa mask - and not the numa domain that the 
> page was touched by (I guess it only gives the expected answer when 
> set_area_membind is used first)
>
> I had hoped to use a dynamic query of the pages (using the first one of a 
> given tile) to schedule each task that operates on a given tile to run on the 
> numa node that touched it.
>
> I can work around this by using a matrix offset calculation to get the numa 
> node, but if there's a way of querying the page directly - then please let me 
> know.
>
> Thanks
>
> JB 
> 
> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of 
> Samuel Thibault [samuel.thiba...@inria.fr]
> Sent: 12 November 2017 10:48
> To: Hardware locality user list
> Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
>
> Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote:
>> That's likely what's happening. Each set_area() may be creating a new 
>> "virtual
>> memory area". The kernel tries to merge them with neighbors if they go to the
>> same NUMA node. Otherwise it creates a new VMA.
> Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to
> strictly bind the memory, but just to allocate on a given memory
> node, and just hope that the allocation will not go away (e.g. due to
> swapping), which thus doesn't need a VMA to record the information. As
> you describe below, first-touch achieves that but it's not necessarily
> so convenient.
>
>> I can't find the exact limit but it's something like 64k so I guess
>> you're exhausting that.
> It's sysctl vm.max_map_count
>
>> Question 2 : Is there a better way of achieving the result I'm looking 
>> for
>> (such as a call to membind with a stride of some kind to say put N pages 
>> in
>> a row on each domain in alternation).
>>
>>
>> Unfortunately, the interleave policy doesn't have a stride argument. It's one
>> page on node 0, one page on node 1, etc.
>>
>> The only idea I have is to use the first-touch policy: Make sure your buffer
>> isn't is physical memory yet, and have a thread on node 0 read the "0" pages,
>> and another thread on node 1 read the "1" page.
> Or "next-touch" if that was to ever get merged into mainline Linux :)
>
> Samuel
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

2017-11-13 Thread Biddiscombe, John A.
Thank you to you both.

I modified the allocator to allocate one large block using hwloc_alloc and then 
use one thread per numa domain to  touch each page according to the tiling 
pattern - unfortunately, I hadn't appreciated that now
hwloc_get_area_membind_nodeset
always returns the full machine numa mask - and not the numa domain that the 
page was touched by (I guess it only gives the expected answer when 
set_area_membind is used first)

I had hoped to use a dynamic query of the pages (using the first one of a given 
tile) to schedule each task that operates on a given tile to run on the numa 
node that touched it.

I can work around this by using a matrix offset calculation to get the numa 
node, but if there's a way of querying the page directly - then please let me 
know.

Thanks

JB 

From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Samuel 
Thibault [samuel.thiba...@inria.fr]
Sent: 12 November 2017 10:48
To: Hardware locality user list
Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote:
> That's likely what's happening. Each set_area() may be creating a new "virtual
> memory area". The kernel tries to merge them with neighbors if they go to the
> same NUMA node. Otherwise it creates a new VMA.

Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to
strictly bind the memory, but just to allocate on a given memory
node, and just hope that the allocation will not go away (e.g. due to
swapping), which thus doesn't need a VMA to record the information. As
you describe below, first-touch achieves that but it's not necessarily
so convenient.

> I can't find the exact limit but it's something like 64k so I guess
> you're exhausting that.

It's sysctl vm.max_map_count

> Question 2 : Is there a better way of achieving the result I'm looking for
> (such as a call to membind with a stride of some kind to say put N pages 
> in
> a row on each domain in alternation).
>
>
> Unfortunately, the interleave policy doesn't have a stride argument. It's one
> page on node 0, one page on node 1, etc.
>
> The only idea I have is to use the first-touch policy: Make sure your buffer
> isn't is physical memory yet, and have a thread on node 0 read the "0" pages,
> and another thread on node 1 read the "1" page.

Or "next-touch" if that was to ever get merged into mainline Linux :)

Samuel
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users