Re: [hwloc-users] Problems with binding memory

2022-03-04 Thread Mike
Hello,

Ah yes I see. I made a very basic mistake. It slipped my mind that the
machine only has two numa nodes, and that the memory binding only concerns
itself with numa nodes since that is when non-uniform memory access comes
into play.
Thanks for your time again.

Mike

Am Mi., 2. März 2022 um 12:38 Uhr schrieb Brice Goglin <
brice.gog...@inria.fr>:

>
> Le 02/03/2022 à 12:31, Mike a écrit :
>
> Hello,
>
>> Can you display both mask before set_area_membind and after
>> get_area_membind and send the entire output of all processes and threads?
>> If you can prefix the line with the PID, it'd help a lot :)
>>
> What do you mean with output of all processes and threads?
> If I execute with 1 MPI rank and 4 threads on that rank I get the
> following masks (all allocation is done on one thread, so only one pid):
>
> pid=1799039
> mask before set_area_membind: 0x0001
> mask after get_area_membind: 0x,0x,,,0x,0x
> mask before set_area_membind: 0x0002
> mask after get_area_membind: 0x,0x,,,0x,0x
> mask before set_area_membind: 0x0004
> mask after get_area_membind: 0x,0x,,,0x,0x
> mask before set_area_membind: 0x0008
> mask after get_area_membind: 0x,0x,,,0x,0x
>
>
> Everything looks normal here. With a single rank and 4 threads, your 4
> threads go on the first 4 cores. All of them are inside the first NUMA
> node. It's normal that all memory goes there.
>
> Your code won't use any core of the second socket/NUMAnode unless you have
> more than 64 threads, so you need more than 16 MPI ranks. Ranks 17 and
> above will allocate memory on the second socket/NUMAnode.
>
> Brice
>
>
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Problems with binding memory

2022-03-02 Thread Brice Goglin


Le 02/03/2022 à 12:31, Mike a écrit :

Hello,

Can you display both mask before set_area_membind and after
get_area_membind and send the entire output of all processes and
threads? If you can prefix the line with the PID, it'd help a lot :)

What do you mean with output of all processes and threads?
If I execute with 1 MPI rank and 4 threads on that rank I get the 
following masks (all allocation is done on one thread, so only one pid):


pid=1799039
mask before set_area_membind: 0x0001
mask after get_area_membind: 0x,0x,,,0x,0x
mask before set_area_membind: 0x0002
mask after get_area_membind: 0x,0x,,,0x,0x
mask before set_area_membind: 0x0004
mask after get_area_membind: 0x,0x,,,0x,0x
mask before set_area_membind: 0x0008
mask after get_area_membind: 0x,0x,,,0x,0x



Everything looks normal here. With a single rank and 4 threads, your 4 
threads go on the first 4 cores. All of them are inside the first NUMA 
node. It's normal that all memory goes there.


Your code won't use any core of the second socket/NUMAnode unless you 
have more than 64 threads, so you need more than 16 MPI ranks. Ranks 17 
and above will allocate memory on the second socket/NUMAnode.


Brice




OpenPGP_signature
Description: OpenPGP digital signature
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Problems with binding memory

2022-03-02 Thread Mike
Hello,

> Can you display both mask before set_area_membind and after
> get_area_membind and send the entire output of all processes and threads?
> If you can prefix the line with the PID, it'd help a lot :)
>
What do you mean with output of all processes and threads?
If I execute with 1 MPI rank and 4 threads on that rank I get the following
masks (all allocation is done on one thread, so only one pid):

pid=1799039
mask before set_area_membind: 0x0001
mask after get_area_membind: 0x,0x,,,0x,0x
mask before set_area_membind: 0x0002
mask after get_area_membind: 0x,0x,,,0x,0x
mask before set_area_membind: 0x0004
mask after get_area_membind: 0x,0x,,,0x,0x
mask before set_area_membind: 0x0008
mask after get_area_membind: 0x,0x,,,0x,0x

 Mike

Am Mi., 2. März 2022 um 11:58 Uhr schrieb Brice Goglin <
brice.gog...@inria.fr>:

> Le 02/03/2022 à 11:38, Mike a écrit :
>
> Hello,
>
> If you print the set that is built before calling set_area_membind, you
>> should only see 4 bits in there, right? (since threadcount=4 in your code)
>>
>> I'd say 0xf for rank0, 0xf0 for rank1, etc.
>>
>> set_area_membind() will translate that into a single NUMA node, before
>> asking the kernel to bind. Later get_area_membind translate the single NUMA
>> node back into a set that contains all PUs of the NUMA node.
>>
>> That said, I am not sure I understand what threadcount means in your
>> code. Are you calling the allocate function multiple times with many
>> different ranks? (MPI ranks?)
>>
> The allocator function is called once for every MPI rank and threadcount
> is the number of threads that run on one MPI rank.
> I build the set so that only 1 bit is set before calling set_area_membind,
> so that the memory can only be bound to the specified hardware core.
> Basically, I call set_area_membind once for every thread on a MPI rank.
> After the allocation I will call hwloc_set_cpubind with a set that has
> again 1 bit set, so that (if all works properly) I bound an area of memory
> and a software thread to one specific hardware core.
>
>
> Can you display both mask before set_area_membind and after
> get_area_membind and send the entire output of all processes and threads?
> If you can prefix the line with the PID, it'd help a lot :)
>
> Brice
>
>
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Problems with binding memory

2022-03-02 Thread Brice Goglin

Le 02/03/2022 à 11:38, Mike a écrit :

Hello,

If you print the set that is built before calling
set_area_membind, you should only see 4 bits in there, right?
(since threadcount=4 in your code)

I'd say 0xf for rank0, 0xf0 for rank1, etc.

set_area_membind() will translate that into a single NUMA node,
before asking the kernel to bind. Later get_area_membind translate
the single NUMA node back into a set that contains all PUs of the
NUMA node.

That said, I am not sure I understand what threadcount means in
your code. Are you calling the allocate function multiple times
with many different ranks? (MPI ranks?)

The allocator function is called once for every MPI rank and 
threadcount is the number of threads that run on one MPI rank.
I build the set so that only 1 bit is set before calling 
set_area_membind, so that the memory can only be bound to the 
specified hardware core. Basically, I call set_area_membind once for 
every thread on a MPI rank.
After the allocation I will call hwloc_set_cpubind with a set that has 
again 1 bit set, so that (if all works properly) I bound an area of 
memory and a software thread to one specific hardware core.




Can you display both mask before set_area_membind and after 
get_area_membind and send the entire output of all processes and 
threads? If you can prefix the line with the PID, it'd help a lot :)


Brice




OpenPGP_signature
Description: OpenPGP digital signature
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Problems with binding memory

2022-03-02 Thread Mike
Hello,

If you print the set that is built before calling set_area_membind, you
> should only see 4 bits in there, right? (since threadcount=4 in your code)
>
> I'd say 0xf for rank0, 0xf0 for rank1, etc.
>
> set_area_membind() will translate that into a single NUMA node, before
> asking the kernel to bind. Later get_area_membind translate the single NUMA
> node back into a set that contains all PUs of the NUMA node.
>
> That said, I am not sure I understand what threadcount means in your code.
> Are you calling the allocate function multiple times with many different
> ranks? (MPI ranks?)
>
The allocator function is called once for every MPI rank and threadcount is
the number of threads that run on one MPI rank.
I build the set so that only 1 bit is set before calling set_area_membind,
so that the memory can only be bound to the specified hardware core.
Basically, I call set_area_membind once for every thread on a MPI rank.
After the allocation I will call hwloc_set_cpubind with a set that has
again 1 bit set, so that (if all works properly) I bound an area of memory
and a software thread to one specific hardware core.

Mike


Am Mi., 2. März 2022 um 10:50 Uhr schrieb Brice Goglin <
brice.gog...@inria.fr>:

>
> Le 02/03/2022 à 10:09, Mike a écrit :
>
> Ok then your mask 0x,0x,,,0x,0x
>> corresponds exactly to NUMA node 0 (socket 0). Object cpusets can be
>> displayed on the command-line with "lstopo --cpuset" or "hwloc-calc numa:0".
>>
>> This would be OK if you're only spawning threads to the first socket. Do
>> you see the same mask for threads on the other socket?
>>
> Yes, I do.
>
>
> If you print the set that is built before calling set_area_membind, you
> should only see 4 bits in there, right? (since threadcount=4 in your code)
>
> I'd say 0xf for rank0, 0xf0 for rank1, etc.
>
> set_area_membind() will translate that into a single NUMA node, before
> asking the kernel to bind. Later get_area_membind translate the single NUMA
> node back into a set that contains all PUs of the NUMA node.
>
> That said, I am not sure I understand what threadcount means in your code.
> Are you calling the allocate function multiple times with many different
> ranks? (MPI ranks?)
>
> Brice
>
>
>
>
> Mike
>
> Am Mi., 2. März 2022 um 09:53 Uhr schrieb Brice Goglin <
> brice.gog...@inria.fr>:
>
>> Le 02/03/2022 à 09:39, Mike a écrit :
>>
>> Hello,
>>
>> Please run "lstopo -.synthetic" to compress the output a lot. I will be
>>> able to reuse it from here and understand your binding mask.
>>>
>> Package:2 [NUMANode(memory=270369247232)] L3Cache:8(size=33554432)
>> L2Cache:8(size=524288) L1dCache:1(size=32768) L1iCache:1(size=32768) Core:1
>> PU:2(indexes=2*128:1*2)
>>
>>
>> Ok then your mask 0x,0x,,,0x,0x
>> corresponds exactly to NUMA node 0 (socket 0). Object cpusets can be
>> displayed on the command-line with "lstopo --cpuset" or "hwloc-calc numa:0".
>>
>> This would be OK if you're only spawning threads to the first socket. Do
>> you see the same mask for threads on the other socket?
>>
>> Brice
>>
>> ___
>> hwloc-users mailing list
>> hwloc-users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
>
>
> ___
> hwloc-users mailing 
> listhwloc-us...@lists.open-mpi.orghttps://lists.open-mpi.org/mailman/listinfo/hwloc-users
>
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Problems with binding memory

2022-03-02 Thread Brice Goglin


Le 02/03/2022 à 10:09, Mike a écrit :


Ok then your mask 0x,0x,,,0x,0x
corresponds exactly to NUMA node 0 (socket 0). Object cpusets can
be displayed on the command-line with "lstopo --cpuset" or
"hwloc-calc numa:0".

This would be OK if you're only spawning threads to the first
socket. Do you see the same mask for threads on the other socket?

Yes, I do.



If you print the set that is built before calling set_area_membind, you 
should only see 4 bits in there, right? (since threadcount=4 in your code)


I'd say 0xf for rank0, 0xf0 for rank1, etc.

set_area_membind() will translate that into a single NUMA node, before 
asking the kernel to bind. Later get_area_membind translate the single 
NUMA node back into a set that contains all PUs of the NUMA node.


That said, I am not sure I understand what threadcount means in your 
code. Are you calling the allocate function multiple times with many 
different ranks? (MPI ranks?)


Brice





Mike

Am Mi., 2. März 2022 um 09:53 Uhr schrieb Brice Goglin 
mailto:brice.gog...@inria.fr>>:


Le 02/03/2022 à 09:39, Mike a écrit :

Hello,

Please run "lstopo -.synthetic" to compress the output a lot.
I will be able to reuse it from here and understand your
binding mask.

Package:2 [NUMANode(memory=270369247232)]
L3Cache:8(size=33554432) L2Cache:8(size=524288)
L1dCache:1(size=32768) L1iCache:1(size=32768) Core:1
PU:2(indexes=2*128:1*2)



Ok then your mask 0x,0x,,,0x,0x
corresponds exactly to NUMA node 0 (socket 0). Object cpusets can
be displayed on the command-line with "lstopo --cpuset" or
"hwloc-calc numa:0".

This would be OK if you're only spawning threads to the first
socket. Do you see the same mask for threads on the other socket?

Brice


___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org 
https://lists.open-mpi.org/mailman/listinfo/hwloc-users



___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


OpenPGP_signature
Description: OpenPGP digital signature
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Problems with binding memory

2022-03-02 Thread Mike
>
> Ok then your mask 0x,0x,,,0x,0x
> corresponds exactly to NUMA node 0 (socket 0). Object cpusets can be
> displayed on the command-line with "lstopo --cpuset" or "hwloc-calc numa:0".
>
> This would be OK if you're only spawning threads to the first socket. Do
> you see the same mask for threads on the other socket?
>
Yes, I do.

Mike

Am Mi., 2. März 2022 um 09:53 Uhr schrieb Brice Goglin <
brice.gog...@inria.fr>:

> Le 02/03/2022 à 09:39, Mike a écrit :
>
> Hello,
>
> Please run "lstopo -.synthetic" to compress the output a lot. I will be
>> able to reuse it from here and understand your binding mask.
>>
> Package:2 [NUMANode(memory=270369247232)] L3Cache:8(size=33554432)
> L2Cache:8(size=524288) L1dCache:1(size=32768) L1iCache:1(size=32768) Core:1
> PU:2(indexes=2*128:1*2)
>
>
> Ok then your mask 0x,0x,,,0x,0x
> corresponds exactly to NUMA node 0 (socket 0). Object cpusets can be
> displayed on the command-line with "lstopo --cpuset" or "hwloc-calc numa:0".
>
> This would be OK if you're only spawning threads to the first socket. Do
> you see the same mask for threads on the other socket?
>
> Brice
>
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Problems with binding memory

2022-03-02 Thread Brice Goglin

Le 02/03/2022 à 09:39, Mike a écrit :

Hello,

Please run "lstopo -.synthetic" to compress the output a lot. I
will be able to reuse it from here and understand your binding mask.

Package:2 [NUMANode(memory=270369247232)] L3Cache:8(size=33554432) 
L2Cache:8(size=524288) L1dCache:1(size=32768) L1iCache:1(size=32768) 
Core:1 PU:2(indexes=2*128:1*2)




Ok then your mask 0x,0x,,,0x,0x 
corresponds exactly to NUMA node 0 (socket 0). Object cpusets can be 
displayed on the command-line with "lstopo --cpuset" or "hwloc-calc numa:0".


This would be OK if you're only spawning threads to the first socket. Do 
you see the same mask for threads on the other socket?


Brice




OpenPGP_signature
Description: OpenPGP digital signature
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Problems with binding memory

2022-03-02 Thread Mike
Hello,

Please run "lstopo -.synthetic" to compress the output a lot. I will be
> able to reuse it from here and understand your binding mask.
>
Package:2 [NUMANode(memory=270369247232)] L3Cache:8(size=33554432)
L2Cache:8(size=524288) L1dCache:1(size=32768) L1iCache:1(size=32768) Core:1
PU:2(indexes=2*128:1*2)

Mike


Am Di., 1. März 2022 um 19:05 Uhr schrieb Brice Goglin <
brice.gog...@inria.fr>:

>
> Le 01/03/2022 à 17:34, Mike a écrit :
>
> Hello,
>
> Usually you would rather allocate and bind at the same time so that the
>> memory doesn't need to be migrated when bound. However, if you do not touch
>> the memory after allocation, pages are not actually physically allocated,
>> hence there's no to migrate. Might work but keep this in mind.
>>
>
> I need all the data in one allocation, so that is why I opted to allocate
> and then bind via the area function. The way I understand it is that by
> using the memory binding policy HWLOC_MEMBIND_BIND with
> hwloc_set_area_membind() the pages will actually get allocated on the
> specified cores. If that is not the case I suppose the best solution would
> be to just touch the allocated data with my threads.
>
>
> set_area_membind() doesn't allocate pages, but it tells the operating
> system "whenever you allocate them, do it on that NUMA node". Anyway, what
> you're doing makes sense.
>
>
>
> Can you print memory binding like below instead of printing only the first
>> PU in the set returned by get_area_membind?
>>
> char *s;
> hwloc_bitmap_asprintf(, set);
> /* s is now a C string of the bitmap, use it in your std::cout line */
>
> I tried that and now get_area_membind returns that all memory is bound to
> 0x,0x,,,0x,0x
>
>
> Please run "lstopo -.synthetic" to compress the output a lot. I will be
> able to reuse it from here and understand your binding mask.
> Brice
>
>
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Problems with binding memory

2022-03-01 Thread Brice Goglin


Le 01/03/2022 à 17:34, Mike a écrit :

Hello,

Usually you would rather allocate and bind at the same time so
that the memory doesn't need to be migrated when bound. However,
if you do not touch the memory after allocation, pages are not
actually physically allocated, hence there's no to migrate. Might
work but keep this in mind.


I need all the data in one allocation, so that is why I opted to 
allocate and then bind via the area function. The way I understand it 
is that by using the memory binding policy HWLOC_MEMBIND_BIND with 
hwloc_set_area_membind() the pages will actually get allocated on the 
specified cores. If that is not the case I suppose the best solution 
would be to just touch the allocated data with my threads.



set_area_membind() doesn't allocate pages, but it tells the operating 
system "whenever you allocate them, do it on that NUMA node". Anyway, 
what you're doing makes sense.





Can you print memory binding like below instead of printing only
the first PU in the set returned by get_area_membind?

    char *s;
    hwloc_bitmap_asprintf(, set);
    /* s is now a C string of the bitmap, use it in your std::cout line */

I tried that and now get_area_membind returns that all memory is bound 
to 0x,0x,,,0x,0x




Please run "lstopo -.synthetic" to compress the output a lot. I will be 
able to reuse it from here and understand your binding mask.


Brice




OpenPGP_signature
Description: OpenPGP digital signature
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Problems with binding memory

2022-03-01 Thread Mike
Hello,

Usually you would rather allocate and bind at the same time so that the
> memory doesn't need to be migrated when bound. However, if you do not touch
> the memory after allocation, pages are not actually physically allocated,
> hence there's no to migrate. Might work but keep this in mind.
>

I need all the data in one allocation, so that is why I opted to allocate
and then bind via the area function. The way I understand it is that by
using the memory binding policy HWLOC_MEMBIND_BIND with
hwloc_set_area_membind() the pages will actually get allocated on the
specified cores. If that is not the case I suppose the best solution would
be to just touch the allocated data with my threads.

Can you print memory binding like below instead of printing only the first
> PU in the set returned by get_area_membind?
>
char *s;
hwloc_bitmap_asprintf(, set);
/* s is now a C string of the bitmap, use it in your std::cout line */

I tried that and now get_area_membind returns that all memory is bound to
0x,0x,,,0x,0x

>
> People often do the contrary. They bind threads, and then they have
> threads allocate/touch memory so that buffers are physically allocated near
> the related threads (automatic by default). It works well when the number
> of threads is known in advance. You place one thread per core, they never
> move. As long as memory is big enough to store the data nearby, everybody's
> happy. If the number of threads varies at runtime, and/or if they need to
> move, things become more difficult.
>
> Your approach is also correct. In the end, it's rather a question of
> whether you're code is data-centric or compute-centric, and whether
> imbalances may require to move things during the execution. Moving threads
> is usually cheaper. But oversubscribing cores with multiple threads is
> usually a bad idea, that's likely why people place one thread per core
> first.
>
My code is rather data-bound and my main motivation for binding the threads
is because I did not want hyperthreading on cores and because I want to
keep all threads that operate on the same data in one L3 Cache.

And send the output of lstopo on your machine so that I can understand it.
>
The machine has two sockets and on each socket are 64 cores. Cores 0-7
share one L3 cache, so do cores 8-15 and so on.
The output of lstopo is quite large, but if my description does not suffice
I can send it.



Thanks for your time

Mike

Am Di., 1. März 2022 um 15:42 Uhr schrieb Brice Goglin <
brice.gog...@inria.fr>:

>
> Le 01/03/2022 à 15:17, Mike a écrit :
>
> Dear list,
>
> I have a program that utilizes Openmpi + multithreading and I want the
> freedom to decide on which hardware cores my threads should run. By using
> hwloc_set_cpubind() that already works, so now I also want to bind memory
> to the hardware cores. But I just can't get it to work.
>
> Basically, I wrote the memory binding into my allocator, so the memory
> will be allocated and then bound.
>
>
> Hello
>
> Usually you would rather allocate and bind at the same time so that the
> memory doesn't need to be migrated when bound. However, if you do not touch
> the memory after allocation, pages are not actually physically allocated,
> hence there's no to migrate. Might work but keep this in mind.
>
>
> I use hwloc 2.4.1, run the code on a Linux system and I did check with
> “hwloc-info --support” if hwloc_set_area_membind() and
> hwloc_get_area_membind() are supported and they are.
>
> Here is a snippet of my code, which runs through without any error. But
> the hwloc_get_area_membind() always returns that all memory is bound to PU
> 0, when I think it should be bound to different PUs. Am I missing something?
>
>
> Can you print memory binding like below instead of printing only the first
> PU in the set returned by get_area_membind?
>
> char *s;
> hwloc_bitmap_asprintf(, set);
> /* s is now a C string of the bitmap, use it in your std::cout line */
>
> And send the output of lstopo on your machine so that I can understand it.
>
> Or you could print the smallest object that contains the binding by
> calling hwloc_get_obj_covering_cpuset(topology, set). It returns an object
> whose type may be printed as a C-string with
> hwloc_obj_type_string(obj->type).
>
> You may also do the same before set_area_membind() if you want to verify
> that you're bindin where you really want.
>
>
>
> T* allocate(size_t n, hwloc_topology_t topology, int rank)
> {
>   // allocate memory
>   T* t = (T*)hwloc_alloc(topology, sizeof(T) * n);
>   // elements perthread
>   size_t ept = 1024;
>   hwloc_bitmap_t set;
>   size_t offset = 0;
>   size_t threadcount= 4;
>
>   set = hwloc_bitmap_alloc();
>   if(!set) {
> fprintf(stderr, "failed to allocate a bitmap\n");
>   }
>   // bind memory to every thread
>   for(size_t i = 0;i < threadcount; i++)
>   {
> // logical indexof where to bind the memory
> auto logid = (i +rank * threadcount) * 2;
> auto logobj = 

Re: [hwloc-users] Problems with binding memory

2022-03-01 Thread Brice Goglin


Le 01/03/2022 à 15:17, Mike a écrit :


Dear list,

I have a program that utilizes Openmpi + multithreading and I want the 
freedom to decide on which hardware cores my threads should run. By 
using hwloc_set_cpubind() that already works, so now I also want to 
bind memory to the hardware cores. But I just can't get it to work.


Basically, I wrote the memory binding into my allocator, so the memory 
will be allocated and then bound.




Hello

Usually you would rather allocate and bind at the same time so that the 
memory doesn't need to be migrated when bound. However, if you do not 
touch the memory after allocation, pages are not actually physically 
allocated, hence there's no to migrate. Might work but keep this in mind.



I use hwloc 2.4.1, run the code on a Linux system and I did check with 
“hwloc-info --support” if hwloc_set_area_membind() and 
hwloc_get_area_membind() are supported and they are.


Here is a snippet of my code, which runs through without any error. 
But the hwloc_get_area_membind() always returns that all memory is 
bound to PU 0, when I think it should be bound to different PUs. Am I 
missing something?




Can you print memory binding like below instead of printing only the 
first PU in the set returned by get_area_membind?


char *s;
hwloc_bitmap_asprintf(, set);
/* s is now a C string of the bitmap, use it in your std::cout line */

And send the output of lstopo on your machine so that I can understand it.

Or you could print the smallest object that contains the binding by 
calling hwloc_get_obj_covering_cpuset(topology, set). It returns an 
object whose type may be printed as a C-string with 
hwloc_obj_type_string(obj->type).


You may also do the same before set_area_membind() if you want to verify 
that you're bindin where you really want.





T* allocate(size_t n, hwloc_topology_t topology, int rank)
{
  // allocate memory
  T* t = (T*)hwloc_alloc(topology, sizeof(T) * n);
  // elements perthread
  size_t ept = 1024;
  hwloc_bitmap_t set;
  size_t offset = 0;
  size_t threadcount= 4;

  set = hwloc_bitmap_alloc();
  if(!set) {
    fprintf(stderr, "failed to allocate a bitmap\n");
  }
  // bind memory to every thread
  for(size_t i = 0;i < threadcount; i++)
  {
    // logical indexof where to bind the memory
    auto logid = (i +rank * threadcount) * 2;
    auto logobj = hwloc_get_obj_by_type(topology, HWLOC_OBJ_PU, logid);
    hwloc_bitmap_only(set, logobj->os_index);
    //set the memory binding
    // I use HWLOC_MEMBIND_BIND as policy so I do not have to touch 
the memory first to allocate it
    auto err = hwloc_set_area_membind(topology, t + offset, sizeof(T) 
*ept, set, HWLOC_MEMBIND_BIND, HWLOC_MEMBIND_STRICT | 
HWLOC_MEMBIND_THREAD);

    if(err < 0)
      std::cout << "Error: memory binding failed" <    std::cout << "Rank=" << rank << " Tid=" << i << " on PU logical 
index="