Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

2017-11-15 Thread Brice Goglin
A random guess would be that numactl-devel isn't installed on the
system. It's required for numa-related syscalls (there's a summary at
the end of configure saying whether libnuma was found, or you can ldd
libhwloc.so and see if libnuma is listed). This will go away in 2.0
because we don't use libnuma at all anymore.

Brice



Le 15/11/2017 09:51, Biddiscombe, John A. a écrit :
> Running my test on another machine (fedora 7.2) I get a 
> hwloc_get_area_memlocation fail
> with strerror = "Function not implemented"
>
> Does this mean that the OS has not implemented it (I'm using 1.11.8 hwloc 
> version - on the primary test machine I used 1.11.17) - am I doomed? - or 
> will things magically work if I upgrade to hwloc 2.0 etc etc
>
> Thanks
>
> JB
>
>
> -Original Message-
> From: hwloc-users [mailto:hwloc-users-boun...@lists.open-mpi.org] On Behalf 
> Of Biddiscombe, John A.
> Sent: 13 November 2017 15:37
> To: Hardware locality user list 
> Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
>
> It's working and I'm seeing the binding pattern I hoped for.
>
> Thanks again
>
> JB
>
> 
> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Brice 
> Goglin [brice.gog...@inria.fr]
> Sent: 13 November 2017 15:32
> To: Hardware locality user list
> Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
>
> The doc is wrong, flags are used, only for BY_NODESET. I actually fixed that 
> in git very recently.
>
> Brice
>
>
>
> Le 13/11/2017 07:24, Biddiscombe, John A. a écrit :
>> In the documentation for get_area_memlocation it says "If 
>> HWLOC_MEMBIND_BYNODESET is specified, set is considered a nodeset. Otherwise 
>> it's a cpuset."
>>
>> but it also says "Flags are currently unused."
>>
>> so where should the BY_NODESET policy be used? Does it have to be used with 
>> the original alloc call?
>>
>> thanks
>>
>> JB
>>
>> ____
>> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf 
>> of Biddiscombe, John A. [biddi...@cscs.ch]
>> Sent: 13 November 2017 14:59
>> To: Hardware locality user list
>> Subject: Re: [hwloc-users] question about 
>> hwloc_set_area_membind_nodeset
>>
>> Brice
>>
>> aha. thanks. I knew I'd seen a function for that, but couldn't remember what 
>> it was.
>>
>> Cheers
>>
>> JB
>> 
>> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf 
>> of Brice Goglin [brice.gog...@inria.fr]
>> Sent: 13 November 2017 14:57
>> To: Hardware locality user list
>> Subject: Re: [hwloc-users] question about 
>> hwloc_set_area_membind_nodeset
>>
>> Use get_area_memlocation()
>>
>> membind() returns where the pages are *allowed* to go (anywhere)
>> memlocation() returns where the pages are actually allocated.
>>
>> Brice
>>
>>
>>
>>
>> Le 13/11/2017 06:52, Biddiscombe, John A. a écrit :
>>> Thank you to you both.
>>>
>>> I modified the allocator to allocate one large block using 
>>> hwloc_alloc and then use one thread per numa domain to  touch each 
>>> page according to the tiling pattern - unfortunately, I hadn't 
>>> appreciated that now hwloc_get_area_membind_nodeset always returns 
>>> the full machine numa mask - and not the numa domain that the page 
>>> was touched by (I guess it only gives the expected answer when 
>>> set_area_membind is used first)
>>>
>>> I had hoped to use a dynamic query of the pages (using the first one of a 
>>> given tile) to schedule each task that operates on a given tile to run on 
>>> the numa node that touched it.
>>>
>>> I can work around this by using a matrix offset calculation to get the numa 
>>> node, but if there's a way of querying the page directly - then please let 
>>> me know.
>>>
>>> Thanks
>>>
>>> JB
>>> 
>>> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf 
>>> of Samuel Thibault [samuel.thiba...@inria.fr]
>>> Sent: 12 November 2017 10:48
>>> To: Hardware locality user list
>>> Subject: Re: [hwloc-users] question about 
>>> hwloc_set_area_membind_nodeset
>>>
>>> Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote:
>>>&

Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

2017-11-15 Thread Biddiscombe, John A.
Running my test on another machine (fedora 7.2) I get a 
hwloc_get_area_memlocation fail
with strerror = "Function not implemented"

Does this mean that the OS has not implemented it (I'm using 1.11.8 hwloc 
version - on the primary test machine I used 1.11.17) - am I doomed? - or will 
things magically work if I upgrade to hwloc 2.0 etc etc

Thanks

JB


-Original Message-
From: hwloc-users [mailto:hwloc-users-boun...@lists.open-mpi.org] On Behalf Of 
Biddiscombe, John A.
Sent: 13 November 2017 15:37
To: Hardware locality user list 
Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

It's working and I'm seeing the binding pattern I hoped for.

Thanks again

JB


From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Brice 
Goglin [brice.gog...@inria.fr]
Sent: 13 November 2017 15:32
To: Hardware locality user list
Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

The doc is wrong, flags are used, only for BY_NODESET. I actually fixed that in 
git very recently.

Brice



Le 13/11/2017 07:24, Biddiscombe, John A. a écrit :
> In the documentation for get_area_memlocation it says "If 
> HWLOC_MEMBIND_BYNODESET is specified, set is considered a nodeset. Otherwise 
> it's a cpuset."
>
> but it also says "Flags are currently unused."
>
> so where should the BY_NODESET policy be used? Does it have to be used with 
> the original alloc call?
>
> thanks
>
> JB
>
> 
> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf 
> of Biddiscombe, John A. [biddi...@cscs.ch]
> Sent: 13 November 2017 14:59
> To: Hardware locality user list
> Subject: Re: [hwloc-users] question about 
> hwloc_set_area_membind_nodeset
>
> Brice
>
> aha. thanks. I knew I'd seen a function for that, but couldn't remember what 
> it was.
>
> Cheers
>
> JB
> 
> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf 
> of Brice Goglin [brice.gog...@inria.fr]
> Sent: 13 November 2017 14:57
> To: Hardware locality user list
> Subject: Re: [hwloc-users] question about 
> hwloc_set_area_membind_nodeset
>
> Use get_area_memlocation()
>
> membind() returns where the pages are *allowed* to go (anywhere)
> memlocation() returns where the pages are actually allocated.
>
> Brice
>
>
>
>
> Le 13/11/2017 06:52, Biddiscombe, John A. a écrit :
>> Thank you to you both.
>>
>> I modified the allocator to allocate one large block using 
>> hwloc_alloc and then use one thread per numa domain to  touch each 
>> page according to the tiling pattern - unfortunately, I hadn't 
>> appreciated that now hwloc_get_area_membind_nodeset always returns 
>> the full machine numa mask - and not the numa domain that the page 
>> was touched by (I guess it only gives the expected answer when 
>> set_area_membind is used first)
>>
>> I had hoped to use a dynamic query of the pages (using the first one of a 
>> given tile) to schedule each task that operates on a given tile to run on 
>> the numa node that touched it.
>>
>> I can work around this by using a matrix offset calculation to get the numa 
>> node, but if there's a way of querying the page directly - then please let 
>> me know.
>>
>> Thanks
>>
>> JB
>> 
>> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf 
>> of Samuel Thibault [samuel.thiba...@inria.fr]
>> Sent: 12 November 2017 10:48
>> To: Hardware locality user list
>> Subject: Re: [hwloc-users] question about 
>> hwloc_set_area_membind_nodeset
>>
>> Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote:
>>> That's likely what's happening. Each set_area() may be creating a 
>>> new "virtual memory area". The kernel tries to merge them with 
>>> neighbors if they go to the same NUMA node. Otherwise it creates a new VMA.
>> Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to 
>> strictly bind the memory, but just to allocate on a given memory 
>> node, and just hope that the allocation will not go away (e.g. due to 
>> swapping), which thus doesn't need a VMA to record the information. 
>> As you describe below, first-touch achieves that but it's not 
>> necessarily so convenient.
>>
>>> I can't find the exact limit but it's something like 64k so I guess 
>>> you're exhausting that.
>> It's sysctl vm.max_map_count
>>
>>> Ques

Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

2017-11-13 Thread Biddiscombe, John A.
It's working and I'm seeing the binding pattern I hoped for.

Thanks again

JB


From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Brice 
Goglin [brice.gog...@inria.fr]
Sent: 13 November 2017 15:32
To: Hardware locality user list
Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

The doc is wrong, flags are used, only for BY_NODESET. I actually fixed
that in git very recently.

Brice



Le 13/11/2017 07:24, Biddiscombe, John A. a écrit :
> In the documentation for get_area_memlocation it says
> "If HWLOC_MEMBIND_BYNODESET is specified, set is considered a nodeset. 
> Otherwise it's a cpuset."
>
> but it also says "Flags are currently unused."
>
> so where should the BY_NODESET policy be used? Does it have to be used with 
> the original alloc call?
>
> thanks
>
> JB
>
> 
> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of 
> Biddiscombe, John A. [biddi...@cscs.ch]
> Sent: 13 November 2017 14:59
> To: Hardware locality user list
> Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
>
> Brice
>
> aha. thanks. I knew I'd seen a function for that, but couldn't remember what 
> it was.
>
> Cheers
>
> JB
> 
> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Brice 
> Goglin [brice.gog...@inria.fr]
> Sent: 13 November 2017 14:57
> To: Hardware locality user list
> Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
>
> Use get_area_memlocation()
>
> membind() returns where the pages are *allowed* to go (anywhere)
> memlocation() returns where the pages are actually allocated.
>
> Brice
>
>
>
>
> Le 13/11/2017 06:52, Biddiscombe, John A. a écrit :
>> Thank you to you both.
>>
>> I modified the allocator to allocate one large block using hwloc_alloc and 
>> then use one thread per numa domain to  touch each page according to the 
>> tiling pattern - unfortunately, I hadn't appreciated that now
>> hwloc_get_area_membind_nodeset
>> always returns the full machine numa mask - and not the numa domain that the 
>> page was touched by (I guess it only gives the expected answer when 
>> set_area_membind is used first)
>>
>> I had hoped to use a dynamic query of the pages (using the first one of a 
>> given tile) to schedule each task that operates on a given tile to run on 
>> the numa node that touched it.
>>
>> I can work around this by using a matrix offset calculation to get the numa 
>> node, but if there's a way of querying the page directly - then please let 
>> me know.
>>
>> Thanks
>>
>> JB
>> 
>> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of 
>> Samuel Thibault [samuel.thiba...@inria.fr]
>> Sent: 12 November 2017 10:48
>> To: Hardware locality user list
>> Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
>>
>> Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote:
>>> That's likely what's happening. Each set_area() may be creating a new 
>>> "virtual
>>> memory area". The kernel tries to merge them with neighbors if they go to 
>>> the
>>> same NUMA node. Otherwise it creates a new VMA.
>> Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to
>> strictly bind the memory, but just to allocate on a given memory
>> node, and just hope that the allocation will not go away (e.g. due to
>> swapping), which thus doesn't need a VMA to record the information. As
>> you describe below, first-touch achieves that but it's not necessarily
>> so convenient.
>>
>>> I can't find the exact limit but it's something like 64k so I guess
>>> you're exhausting that.
>> It's sysctl vm.max_map_count
>>
>>> Question 2 : Is there a better way of achieving the result I'm looking 
>>> for
>>> (such as a call to membind with a stride of some kind to say put N 
>>> pages in
>>> a row on each domain in alternation).
>>>
>>>
>>> Unfortunately, the interleave policy doesn't have a stride argument. It's 
>>> one
>>> page on node 0, one page on node 1, etc.
>>>
>>> The only idea I have is to use the first-touch policy: Make sure your buffer
>>> isn't is physical memory yet, and have a thread on node 

Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

2017-11-13 Thread Brice Goglin
The doc is wrong, flags are used, only for BY_NODESET. I actually fixed
that in git very recently.

Brice



Le 13/11/2017 07:24, Biddiscombe, John A. a écrit :
> In the documentation for get_area_memlocation it says
> "If HWLOC_MEMBIND_BYNODESET is specified, set is considered a nodeset. 
> Otherwise it's a cpuset."
>
> but it also says "Flags are currently unused."
>
> so where should the BY_NODESET policy be used? Does it have to be used with 
> the original alloc call?
>
> thanks
>
> JB
>
> 
> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of 
> Biddiscombe, John A. [biddi...@cscs.ch]
> Sent: 13 November 2017 14:59
> To: Hardware locality user list
> Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
>
> Brice
>
> aha. thanks. I knew I'd seen a function for that, but couldn't remember what 
> it was.
>
> Cheers
>
> JB
> 
> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Brice 
> Goglin [brice.gog...@inria.fr]
> Sent: 13 November 2017 14:57
> To: Hardware locality user list
> Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
>
> Use get_area_memlocation()
>
> membind() returns where the pages are *allowed* to go (anywhere)
> memlocation() returns where the pages are actually allocated.
>
> Brice
>
>
>
>
> Le 13/11/2017 06:52, Biddiscombe, John A. a écrit :
>> Thank you to you both.
>>
>> I modified the allocator to allocate one large block using hwloc_alloc and 
>> then use one thread per numa domain to  touch each page according to the 
>> tiling pattern - unfortunately, I hadn't appreciated that now
>> hwloc_get_area_membind_nodeset
>> always returns the full machine numa mask - and not the numa domain that the 
>> page was touched by (I guess it only gives the expected answer when 
>> set_area_membind is used first)
>>
>> I had hoped to use a dynamic query of the pages (using the first one of a 
>> given tile) to schedule each task that operates on a given tile to run on 
>> the numa node that touched it.
>>
>> I can work around this by using a matrix offset calculation to get the numa 
>> node, but if there's a way of querying the page directly - then please let 
>> me know.
>>
>> Thanks
>>
>> JB
>> 
>> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of 
>> Samuel Thibault [samuel.thiba...@inria.fr]
>> Sent: 12 November 2017 10:48
>> To: Hardware locality user list
>> Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
>>
>> Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote:
>>> That's likely what's happening. Each set_area() may be creating a new 
>>> "virtual
>>> memory area". The kernel tries to merge them with neighbors if they go to 
>>> the
>>> same NUMA node. Otherwise it creates a new VMA.
>> Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to
>> strictly bind the memory, but just to allocate on a given memory
>> node, and just hope that the allocation will not go away (e.g. due to
>> swapping), which thus doesn't need a VMA to record the information. As
>> you describe below, first-touch achieves that but it's not necessarily
>> so convenient.
>>
>>> I can't find the exact limit but it's something like 64k so I guess
>>> you're exhausting that.
>> It's sysctl vm.max_map_count
>>
>>> Question 2 : Is there a better way of achieving the result I'm looking 
>>> for
>>> (such as a call to membind with a stride of some kind to say put N 
>>> pages in
>>> a row on each domain in alternation).
>>>
>>>
>>> Unfortunately, the interleave policy doesn't have a stride argument. It's 
>>> one
>>> page on node 0, one page on node 1, etc.
>>>
>>> The only idea I have is to use the first-touch policy: Make sure your buffer
>>> isn't is physical memory yet, and have a thread on node 0 read the "0" 
>>> pages,
>>> and another thread on node 1 read the "1" page.
>> Or "next-touch" if that was to ever get merged into mainline Linux :)
>>
>> Samuel
>> ___
>> hwloc-users mailing list
>> hwloc-users@lis

Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

2017-11-13 Thread Biddiscombe, John A.
In the documentation for get_area_memlocation it says
"If HWLOC_MEMBIND_BYNODESET is specified, set is considered a nodeset. 
Otherwise it's a cpuset."

but it also says "Flags are currently unused."

so where should the BY_NODESET policy be used? Does it have to be used with the 
original alloc call?

thanks

JB


From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of 
Biddiscombe, John A. [biddi...@cscs.ch]
Sent: 13 November 2017 14:59
To: Hardware locality user list
Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

Brice

aha. thanks. I knew I'd seen a function for that, but couldn't remember what it 
was.

Cheers

JB

From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Brice 
Goglin [brice.gog...@inria.fr]
Sent: 13 November 2017 14:57
To: Hardware locality user list
Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

Use get_area_memlocation()

membind() returns where the pages are *allowed* to go (anywhere)
memlocation() returns where the pages are actually allocated.

Brice




Le 13/11/2017 06:52, Biddiscombe, John A. a écrit :
> Thank you to you both.
>
> I modified the allocator to allocate one large block using hwloc_alloc and 
> then use one thread per numa domain to  touch each page according to the 
> tiling pattern - unfortunately, I hadn't appreciated that now
> hwloc_get_area_membind_nodeset
> always returns the full machine numa mask - and not the numa domain that the 
> page was touched by (I guess it only gives the expected answer when 
> set_area_membind is used first)
>
> I had hoped to use a dynamic query of the pages (using the first one of a 
> given tile) to schedule each task that operates on a given tile to run on the 
> numa node that touched it.
>
> I can work around this by using a matrix offset calculation to get the numa 
> node, but if there's a way of querying the page directly - then please let me 
> know.
>
> Thanks
>
> JB
> 
> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of 
> Samuel Thibault [samuel.thiba...@inria.fr]
> Sent: 12 November 2017 10:48
> To: Hardware locality user list
> Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
>
> Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote:
>> That's likely what's happening. Each set_area() may be creating a new 
>> "virtual
>> memory area". The kernel tries to merge them with neighbors if they go to the
>> same NUMA node. Otherwise it creates a new VMA.
> Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to
> strictly bind the memory, but just to allocate on a given memory
> node, and just hope that the allocation will not go away (e.g. due to
> swapping), which thus doesn't need a VMA to record the information. As
> you describe below, first-touch achieves that but it's not necessarily
> so convenient.
>
>> I can't find the exact limit but it's something like 64k so I guess
>> you're exhausting that.
> It's sysctl vm.max_map_count
>
>> Question 2 : Is there a better way of achieving the result I'm looking 
>> for
>> (such as a call to membind with a stride of some kind to say put N pages 
>> in
>> a row on each domain in alternation).
>>
>>
>> Unfortunately, the interleave policy doesn't have a stride argument. It's one
>> page on node 0, one page on node 1, etc.
>>
>> The only idea I have is to use the first-touch policy: Make sure your buffer
>> isn't is physical memory yet, and have a thread on node 0 read the "0" pages,
>> and another thread on node 1 read the "1" page.
> Or "next-touch" if that was to ever get merged into mainline Linux :)
>
> Samuel
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

2017-11-13 Thread Biddiscombe, John A.
Brice

aha. thanks. I knew I'd seen a function for that, but couldn't remember what it 
was.

Cheers

JB

From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Brice 
Goglin [brice.gog...@inria.fr]
Sent: 13 November 2017 14:57
To: Hardware locality user list
Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

Use get_area_memlocation()

membind() returns where the pages are *allowed* to go (anywhere)
memlocation() returns where the pages are actually allocated.

Brice




Le 13/11/2017 06:52, Biddiscombe, John A. a écrit :
> Thank you to you both.
>
> I modified the allocator to allocate one large block using hwloc_alloc and 
> then use one thread per numa domain to  touch each page according to the 
> tiling pattern - unfortunately, I hadn't appreciated that now
> hwloc_get_area_membind_nodeset
> always returns the full machine numa mask - and not the numa domain that the 
> page was touched by (I guess it only gives the expected answer when 
> set_area_membind is used first)
>
> I had hoped to use a dynamic query of the pages (using the first one of a 
> given tile) to schedule each task that operates on a given tile to run on the 
> numa node that touched it.
>
> I can work around this by using a matrix offset calculation to get the numa 
> node, but if there's a way of querying the page directly - then please let me 
> know.
>
> Thanks
>
> JB
> 
> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of 
> Samuel Thibault [samuel.thiba...@inria.fr]
> Sent: 12 November 2017 10:48
> To: Hardware locality user list
> Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
>
> Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote:
>> That's likely what's happening. Each set_area() may be creating a new 
>> "virtual
>> memory area". The kernel tries to merge them with neighbors if they go to the
>> same NUMA node. Otherwise it creates a new VMA.
> Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to
> strictly bind the memory, but just to allocate on a given memory
> node, and just hope that the allocation will not go away (e.g. due to
> swapping), which thus doesn't need a VMA to record the information. As
> you describe below, first-touch achieves that but it's not necessarily
> so convenient.
>
>> I can't find the exact limit but it's something like 64k so I guess
>> you're exhausting that.
> It's sysctl vm.max_map_count
>
>> Question 2 : Is there a better way of achieving the result I'm looking 
>> for
>> (such as a call to membind with a stride of some kind to say put N pages 
>> in
>> a row on each domain in alternation).
>>
>>
>> Unfortunately, the interleave policy doesn't have a stride argument. It's one
>> page on node 0, one page on node 1, etc.
>>
>> The only idea I have is to use the first-touch policy: Make sure your buffer
>> isn't is physical memory yet, and have a thread on node 0 read the "0" pages,
>> and another thread on node 1 read the "1" page.
> Or "next-touch" if that was to ever get merged into mainline Linux :)
>
> Samuel
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

2017-11-13 Thread Brice Goglin
Use get_area_memlocation()

membind() returns where the pages are *allowed* to go (anywhere)
memlocation() returns where the pages are actually allocated.

Brice




Le 13/11/2017 06:52, Biddiscombe, John A. a écrit :
> Thank you to you both.
>
> I modified the allocator to allocate one large block using hwloc_alloc and 
> then use one thread per numa domain to  touch each page according to the 
> tiling pattern - unfortunately, I hadn't appreciated that now
> hwloc_get_area_membind_nodeset
> always returns the full machine numa mask - and not the numa domain that the 
> page was touched by (I guess it only gives the expected answer when 
> set_area_membind is used first)
>
> I had hoped to use a dynamic query of the pages (using the first one of a 
> given tile) to schedule each task that operates on a given tile to run on the 
> numa node that touched it.
>
> I can work around this by using a matrix offset calculation to get the numa 
> node, but if there's a way of querying the page directly - then please let me 
> know.
>
> Thanks
>
> JB 
> 
> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of 
> Samuel Thibault [samuel.thiba...@inria.fr]
> Sent: 12 November 2017 10:48
> To: Hardware locality user list
> Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
>
> Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote:
>> That's likely what's happening. Each set_area() may be creating a new 
>> "virtual
>> memory area". The kernel tries to merge them with neighbors if they go to the
>> same NUMA node. Otherwise it creates a new VMA.
> Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to
> strictly bind the memory, but just to allocate on a given memory
> node, and just hope that the allocation will not go away (e.g. due to
> swapping), which thus doesn't need a VMA to record the information. As
> you describe below, first-touch achieves that but it's not necessarily
> so convenient.
>
>> I can't find the exact limit but it's something like 64k so I guess
>> you're exhausting that.
> It's sysctl vm.max_map_count
>
>> Question 2 : Is there a better way of achieving the result I'm looking 
>> for
>> (such as a call to membind with a stride of some kind to say put N pages 
>> in
>> a row on each domain in alternation).
>>
>>
>> Unfortunately, the interleave policy doesn't have a stride argument. It's one
>> page on node 0, one page on node 1, etc.
>>
>> The only idea I have is to use the first-touch policy: Make sure your buffer
>> isn't is physical memory yet, and have a thread on node 0 read the "0" pages,
>> and another thread on node 1 read the "1" page.
> Or "next-touch" if that was to ever get merged into mainline Linux :)
>
> Samuel
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

2017-11-13 Thread Biddiscombe, John A.
Thank you to you both.

I modified the allocator to allocate one large block using hwloc_alloc and then 
use one thread per numa domain to  touch each page according to the tiling 
pattern - unfortunately, I hadn't appreciated that now
hwloc_get_area_membind_nodeset
always returns the full machine numa mask - and not the numa domain that the 
page was touched by (I guess it only gives the expected answer when 
set_area_membind is used first)

I had hoped to use a dynamic query of the pages (using the first one of a given 
tile) to schedule each task that operates on a given tile to run on the numa 
node that touched it.

I can work around this by using a matrix offset calculation to get the numa 
node, but if there's a way of querying the page directly - then please let me 
know.

Thanks

JB 

From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Samuel 
Thibault [samuel.thiba...@inria.fr]
Sent: 12 November 2017 10:48
To: Hardware locality user list
Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote:
> That's likely what's happening. Each set_area() may be creating a new "virtual
> memory area". The kernel tries to merge them with neighbors if they go to the
> same NUMA node. Otherwise it creates a new VMA.

Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to
strictly bind the memory, but just to allocate on a given memory
node, and just hope that the allocation will not go away (e.g. due to
swapping), which thus doesn't need a VMA to record the information. As
you describe below, first-touch achieves that but it's not necessarily
so convenient.

> I can't find the exact limit but it's something like 64k so I guess
> you're exhausting that.

It's sysctl vm.max_map_count

> Question 2 : Is there a better way of achieving the result I'm looking for
> (such as a call to membind with a stride of some kind to say put N pages 
> in
> a row on each domain in alternation).
>
>
> Unfortunately, the interleave policy doesn't have a stride argument. It's one
> page on node 0, one page on node 1, etc.
>
> The only idea I have is to use the first-touch policy: Make sure your buffer
> isn't is physical memory yet, and have a thread on node 0 read the "0" pages,
> and another thread on node 1 read the "1" page.

Or "next-touch" if that was to ever get merged into mainline Linux :)

Samuel
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

2017-11-12 Thread Samuel Thibault
Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote:
> That's likely what's happening. Each set_area() may be creating a new "virtual
> memory area". The kernel tries to merge them with neighbors if they go to the
> same NUMA node. Otherwise it creates a new VMA.

Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to
strictly bind the memory, but just to allocate on a given memory
node, and just hope that the allocation will not go away (e.g. due to
swapping), which thus doesn't need a VMA to record the information. As
you describe below, first-touch achieves that but it's not necessarily
so convenient.

> I can't find the exact limit but it's something like 64k so I guess
> you're exhausting that.

It's sysctl vm.max_map_count

> Question 2 : Is there a better way of achieving the result I'm looking for
> (such as a call to membind with a stride of some kind to say put N pages 
> in
> a row on each domain in alternation).
> 
> 
> Unfortunately, the interleave policy doesn't have a stride argument. It's one
> page on node 0, one page on node 1, etc.
> 
> The only idea I have is to use the first-touch policy: Make sure your buffer
> isn't is physical memory yet, and have a thread on node 0 read the "0" pages,
> and another thread on node 1 read the "1" page.

Or "next-touch" if that was to ever get merged into mainline Linux :)

Samuel
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

2017-11-11 Thread Brice Goglin


Le 12/11/2017 00:14, Biddiscombe, John A. a écrit :
> I'm allocating some large matrices, from 10k squared elements up to
> 40k squared per node.
> I'm also using membind to place pages of the matrix memory across numa
> nodes so that the matrix might be bound according to the kind of
> pattern at the end of this email - where each 1 or 0 corresponds to a
> 256x256 block of memory.
>
> The way I'm doing this is by calling hwloc_set_area_membind_nodeset
> many thousands of times after allocation, and I've found that as the
> matrices get bigger, then after some N calls to area_membind then I
> get a failure and it returns -1 (errno does not seem to be set to
> either ENOSYS or EXDEV) - but strerror report "Cannot allocate memory".
>
> Question 1 : by calling area_setmembind too many times, am I causing
> some resource usage in the memory tables that is being exhausted.
>

Hello

That's likely what's happening. Each set_area() may be creating a new
"virtual memory area". The kernel tries to merge them with neighbors if
they go to the same NUMA node. Otherwise it creates a new VMA. I can't
find the exact limit but it's something like 64k so I guess you're
exhausting that.

> Question 2 : Is there a better way of achieving the result I'm looking
> for (such as a call to membind with a stride of some kind to say put N
> pages in a row on each domain in alternation).

Unfortunately, the interleave policy doesn't have a stride argument.
It's one page on node 0, one page on node 1, etc.

The only idea I have is to use the first-touch policy: Make sure your
buffer isn't is physical memory yet, and have a thread on node 0 read
the "0" pages, and another thread on node 1 read the "1" page.

Brice


>
> Many thanks
>
> JB
>
>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ... etc
>
>
>
>
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users