Re: [hwloc-users] Problems with binding memory

2022-03-01 Thread Brice Goglin


Le 01/03/2022 à 17:34, Mike a écrit :

Hello,

Usually you would rather allocate and bind at the same time so
that the memory doesn't need to be migrated when bound. However,
if you do not touch the memory after allocation, pages are not
actually physically allocated, hence there's no to migrate. Might
work but keep this in mind.


I need all the data in one allocation, so that is why I opted to 
allocate and then bind via the area function. The way I understand it 
is that by using the memory binding policy HWLOC_MEMBIND_BIND with 
hwloc_set_area_membind() the pages will actually get allocated on the 
specified cores. If that is not the case I suppose the best solution 
would be to just touch the allocated data with my threads.



set_area_membind() doesn't allocate pages, but it tells the operating 
system "whenever you allocate them, do it on that NUMA node". Anyway, 
what you're doing makes sense.





Can you print memory binding like below instead of printing only
the first PU in the set returned by get_area_membind?

    char *s;
    hwloc_bitmap_asprintf(, set);
    /* s is now a C string of the bitmap, use it in your std::cout line */

I tried that and now get_area_membind returns that all memory is bound 
to 0x,0x,,,0x,0x




Please run "lstopo -.synthetic" to compress the output a lot. I will be 
able to reuse it from here and understand your binding mask.


Brice




OpenPGP_signature
Description: OpenPGP digital signature
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [OMPI users] Trouble compiling OpenMPI with Infiniband support

2022-03-01 Thread Joshua Ladd via users
These are very, very old versions of UCX and HCOLL installed in your
environment. Also, MXM was deprecated years ago in favor of UCX. What
version of MOFED is installed (run ofed_info -s)? What HCA generation is
present (run ibstat).

Josh

On Tue, Mar 1, 2022 at 6:42 AM Angel de Vicente via users <
users@lists.open-mpi.org> wrote:

> Hello,
>
> John Hearns via users  writes:
>
> > Stupid answer from me. If latency/bandwidth numbers are bad then check
> > that you are really running over the interface that you think you
> > should be. You could be falling back to running over Ethernet.
>
> I'm quite out of my depth here, so all answers are helpful, as I might have
> skipped something very obvious.
>
> In order to try and avoid the possibility of falling back to running
> over Ethernet, I submitted the job with:
>
> mpirun -n 2 --mca btl ^tcp osu_latency
>
> which gives me the following error:
>
> ,
> | At least one pair of MPI processes are unable to reach each other for
> | MPI communications.  This means that no Open MPI device has indicated
> | that it can be used to communicate between these processes.  This is
> | an error; Open MPI requires that all MPI processes be able to reach
> | each other.  This error can sometimes be the result of forgetting to
> | specify the "self" BTL.
> |
> |   Process 1 ([[37380,1],1]) is on host: s01r1b20
> |   Process 2 ([[37380,1],0]) is on host: s01r1b19
> |   BTLs attempted: self
> |
> | Your MPI job is now going to abort; sorry.
> `
>
> This is certainly not happening when I use the "native" OpenMPI,
> etc. provided in the cluster. I have not knowingly specified anywhere
> not to support "self", so I have no clue what might be going on, as I
> assumed that "self" was always built for OpenMPI.
>
> Any hints on what (and where) I should look for?
>
> Many thanks,
> --
> Ángel de Vicente
>
> Tel.: +34 922 605 747
> Web.: http://research.iac.es/proyecto/polmag/
>
> -
> AVISO LEGAL: Este mensaje puede contener información confidencial y/o
> privilegiada. Si usted no es el destinatario final del mismo o lo ha
> recibido por error, por favor notifíquelo al remitente inmediatamente.
> Cualquier uso no autorizadas del contenido de este mensaje está
> estrictamente prohibida. Más información en:
> https://www.iac.es/es/responsabilidad-legal
> DISCLAIMER: This message may contain confidential and / or privileged
> information. If you are not the final recipient or have received it in
> error, please notify the sender immediately. Any unauthorized use of the
> content of this message is strictly prohibited. More information:
> https://www.iac.es/en/disclaimer
>


Re: [hwloc-users] Problems with binding memory

2022-03-01 Thread Mike
Hello,

Usually you would rather allocate and bind at the same time so that the
> memory doesn't need to be migrated when bound. However, if you do not touch
> the memory after allocation, pages are not actually physically allocated,
> hence there's no to migrate. Might work but keep this in mind.
>

I need all the data in one allocation, so that is why I opted to allocate
and then bind via the area function. The way I understand it is that by
using the memory binding policy HWLOC_MEMBIND_BIND with
hwloc_set_area_membind() the pages will actually get allocated on the
specified cores. If that is not the case I suppose the best solution would
be to just touch the allocated data with my threads.

Can you print memory binding like below instead of printing only the first
> PU in the set returned by get_area_membind?
>
char *s;
hwloc_bitmap_asprintf(, set);
/* s is now a C string of the bitmap, use it in your std::cout line */

I tried that and now get_area_membind returns that all memory is bound to
0x,0x,,,0x,0x

>
> People often do the contrary. They bind threads, and then they have
> threads allocate/touch memory so that buffers are physically allocated near
> the related threads (automatic by default). It works well when the number
> of threads is known in advance. You place one thread per core, they never
> move. As long as memory is big enough to store the data nearby, everybody's
> happy. If the number of threads varies at runtime, and/or if they need to
> move, things become more difficult.
>
> Your approach is also correct. In the end, it's rather a question of
> whether you're code is data-centric or compute-centric, and whether
> imbalances may require to move things during the execution. Moving threads
> is usually cheaper. But oversubscribing cores with multiple threads is
> usually a bad idea, that's likely why people place one thread per core
> first.
>
My code is rather data-bound and my main motivation for binding the threads
is because I did not want hyperthreading on cores and because I want to
keep all threads that operate on the same data in one L3 Cache.

And send the output of lstopo on your machine so that I can understand it.
>
The machine has two sockets and on each socket are 64 cores. Cores 0-7
share one L3 cache, so do cores 8-15 and so on.
The output of lstopo is quite large, but if my description does not suffice
I can send it.



Thanks for your time

Mike

Am Di., 1. März 2022 um 15:42 Uhr schrieb Brice Goglin <
brice.gog...@inria.fr>:

>
> Le 01/03/2022 à 15:17, Mike a écrit :
>
> Dear list,
>
> I have a program that utilizes Openmpi + multithreading and I want the
> freedom to decide on which hardware cores my threads should run. By using
> hwloc_set_cpubind() that already works, so now I also want to bind memory
> to the hardware cores. But I just can't get it to work.
>
> Basically, I wrote the memory binding into my allocator, so the memory
> will be allocated and then bound.
>
>
> Hello
>
> Usually you would rather allocate and bind at the same time so that the
> memory doesn't need to be migrated when bound. However, if you do not touch
> the memory after allocation, pages are not actually physically allocated,
> hence there's no to migrate. Might work but keep this in mind.
>
>
> I use hwloc 2.4.1, run the code on a Linux system and I did check with
> “hwloc-info --support” if hwloc_set_area_membind() and
> hwloc_get_area_membind() are supported and they are.
>
> Here is a snippet of my code, which runs through without any error. But
> the hwloc_get_area_membind() always returns that all memory is bound to PU
> 0, when I think it should be bound to different PUs. Am I missing something?
>
>
> Can you print memory binding like below instead of printing only the first
> PU in the set returned by get_area_membind?
>
> char *s;
> hwloc_bitmap_asprintf(, set);
> /* s is now a C string of the bitmap, use it in your std::cout line */
>
> And send the output of lstopo on your machine so that I can understand it.
>
> Or you could print the smallest object that contains the binding by
> calling hwloc_get_obj_covering_cpuset(topology, set). It returns an object
> whose type may be printed as a C-string with
> hwloc_obj_type_string(obj->type).
>
> You may also do the same before set_area_membind() if you want to verify
> that you're bindin where you really want.
>
>
>
> T* allocate(size_t n, hwloc_topology_t topology, int rank)
> {
>   // allocate memory
>   T* t = (T*)hwloc_alloc(topology, sizeof(T) * n);
>   // elements perthread
>   size_t ept = 1024;
>   hwloc_bitmap_t set;
>   size_t offset = 0;
>   size_t threadcount= 4;
>
>   set = hwloc_bitmap_alloc();
>   if(!set) {
> fprintf(stderr, "failed to allocate a bitmap\n");
>   }
>   // bind memory to every thread
>   for(size_t i = 0;i < threadcount; i++)
>   {
> // logical indexof where to bind the memory
> auto logid = (i +rank * threadcount) * 2;
> auto logobj = 

Re: [hwloc-users] Problems with binding memory

2022-03-01 Thread Brice Goglin


Le 01/03/2022 à 15:17, Mike a écrit :


Dear list,

I have a program that utilizes Openmpi + multithreading and I want the 
freedom to decide on which hardware cores my threads should run. By 
using hwloc_set_cpubind() that already works, so now I also want to 
bind memory to the hardware cores. But I just can't get it to work.


Basically, I wrote the memory binding into my allocator, so the memory 
will be allocated and then bound.




Hello

Usually you would rather allocate and bind at the same time so that the 
memory doesn't need to be migrated when bound. However, if you do not 
touch the memory after allocation, pages are not actually physically 
allocated, hence there's no to migrate. Might work but keep this in mind.



I use hwloc 2.4.1, run the code on a Linux system and I did check with 
“hwloc-info --support” if hwloc_set_area_membind() and 
hwloc_get_area_membind() are supported and they are.


Here is a snippet of my code, which runs through without any error. 
But the hwloc_get_area_membind() always returns that all memory is 
bound to PU 0, when I think it should be bound to different PUs. Am I 
missing something?




Can you print memory binding like below instead of printing only the 
first PU in the set returned by get_area_membind?


char *s;
hwloc_bitmap_asprintf(, set);
/* s is now a C string of the bitmap, use it in your std::cout line */

And send the output of lstopo on your machine so that I can understand it.

Or you could print the smallest object that contains the binding by 
calling hwloc_get_obj_covering_cpuset(topology, set). It returns an 
object whose type may be printed as a C-string with 
hwloc_obj_type_string(obj->type).


You may also do the same before set_area_membind() if you want to verify 
that you're bindin where you really want.





T* allocate(size_t n, hwloc_topology_t topology, int rank)
{
  // allocate memory
  T* t = (T*)hwloc_alloc(topology, sizeof(T) * n);
  // elements perthread
  size_t ept = 1024;
  hwloc_bitmap_t set;
  size_t offset = 0;
  size_t threadcount= 4;

  set = hwloc_bitmap_alloc();
  if(!set) {
    fprintf(stderr, "failed to allocate a bitmap\n");
  }
  // bind memory to every thread
  for(size_t i = 0;i < threadcount; i++)
  {
    // logical indexof where to bind the memory
    auto logid = (i +rank * threadcount) * 2;
    auto logobj = hwloc_get_obj_by_type(topology, HWLOC_OBJ_PU, logid);
    hwloc_bitmap_only(set, logobj->os_index);
    //set the memory binding
    // I use HWLOC_MEMBIND_BIND as policy so I do not have to touch 
the memory first to allocate it
    auto err = hwloc_set_area_membind(topology, t + offset, sizeof(T) 
*ept, set, HWLOC_MEMBIND_BIND, HWLOC_MEMBIND_STRICT | 
HWLOC_MEMBIND_THREAD);

    if(err < 0)
      std::cout << "Error: memory binding failed" <    std::cout << "Rank=" << rank << " Tid=" << i << " on PU logical 
index=" 

[hwloc-users] Problems with binding memory

2022-03-01 Thread Mike
Dear list,

I have a program that utilizes Openmpi + multithreading and I want the
freedom to decide on which hardware cores my threads should run. By using
hwloc_set_cpubind() that already works, so now I also want to bind memory
to the hardware cores. But I just can't get it to work.

Basically, I wrote the memory binding into my allocator, so the memory will
be allocated and then bound. I use hwloc 2.4.1, run the code on a Linux
system and I did check with “hwloc-info --support” if
hwloc_set_area_membind() and hwloc_get_area_membind() are supported and
they are.

Here is a snippet of my code, which runs through without any error. But the
hwloc_get_area_membind() always returns that all memory is bound to PU 0,
when I think it should be bound to different PUs. Am I missing something?

T* allocate(size_t n, hwloc_topology_t topology, int rank)
{
  // allocate memory
  T* t = (T*)hwloc_alloc(topology, sizeof(T) * n);
  // elements perthread
  size_t ept = 1024;
  hwloc_bitmap_t set;
  size_t offset = 0;
  size_t threadcount= 4;

  set = hwloc_bitmap_alloc();
  if(!set) {
fprintf(stderr, "failed to allocate a bitmap\n");
  }
  // bind memory to every thread
  for(size_t i = 0;i < threadcount; i++)
  {
// logical indexof where to bind the memory
auto logid = (i +rank * threadcount) * 2;
auto logobj = hwloc_get_obj_by_type(topology, HWLOC_OBJ_PU, logid);
hwloc_bitmap_only(set, logobj->os_index);
//set the memory binding
// I use HWLOC_MEMBIND_BIND as policy so I do not have to touch the
memory first to allocate it
auto err = hwloc_set_area_membind(topology, t + offset, sizeof(T) *ept,
set, HWLOC_MEMBIND_BIND, HWLOC_MEMBIND_STRICT | HWLOC_MEMBIND_THREAD);
if(err < 0)
  std::cout << "Error: memory binding failed" <___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [OMPI users] Trouble compiling OpenMPI with Infiniband support

2022-03-01 Thread Angel de Vicente via users
Hello,

John Hearns via users  writes:

> Stupid answer from me. If latency/bandwidth numbers are bad then check
> that you are really running over the interface that you think you
> should be. You could be falling back to running over Ethernet.

I'm quite out of my depth here, so all answers are helpful, as I might have
skipped something very obvious.

In order to try and avoid the possibility of falling back to running
over Ethernet, I submitted the job with:

mpirun -n 2 --mca btl ^tcp osu_latency

which gives me the following error:

,
| At least one pair of MPI processes are unable to reach each other for
| MPI communications.  This means that no Open MPI device has indicated
| that it can be used to communicate between these processes.  This is
| an error; Open MPI requires that all MPI processes be able to reach
| each other.  This error can sometimes be the result of forgetting to
| specify the "self" BTL.
| 
|   Process 1 ([[37380,1],1]) is on host: s01r1b20
|   Process 2 ([[37380,1],0]) is on host: s01r1b19
|   BTLs attempted: self
| 
| Your MPI job is now going to abort; sorry.
`

This is certainly not happening when I use the "native" OpenMPI,
etc. provided in the cluster. I have not knowingly specified anywhere
not to support "self", so I have no clue what might be going on, as I
assumed that "self" was always built for OpenMPI.

Any hints on what (and where) I should look for?

Many thanks,
-- 
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://research.iac.es/proyecto/polmag/
-
AVISO LEGAL: Este mensaje puede contener información confidencial y/o 
privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido 
por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso no 
autorizadas del contenido de este mensaje está estrictamente prohibida. Más 
información en: https://www.iac.es/es/responsabilidad-legal
DISCLAIMER: This message may contain confidential and / or privileged 
information. If you are not the final recipient or have received it in error, 
please notify the sender immediately. Any unauthorized use of the content of 
this message is strictly prohibited. More information:  
https://www.iac.es/en/disclaimer


Re: [OMPI users] Trouble compiling OpenMPI with Infiniband support

2022-03-01 Thread John Hearns via users
Stupid answer from me. If latency/bandwidth numbers are bad then check that
you are really running over the interface that you think you should be. You
could be falling back to running over Ethernet.

On Mon, 28 Feb 2022 at 20:10, Angel de Vicente via users <
users@lists.open-mpi.org> wrote:

> Hello,
>
> "Jeff Squyres (jsquyres)"  writes:
>
> > I'd recommend against using Open MPI v3.1.0 -- it's quite old.  If you
> > have to use Open MPI v3.1.x, I'd at least suggest using v3.1.6, which
> > has all the rolled-up bug fixes on the v3.1.x series.
> >
> > That being said, Open MPI v4.1.2 is the most current.  Open MPI v4.1.2
> does
> > restrict which versions of UCX it uses because there are bugs in the
> older
> > versions of UCX.  I am not intimately familiar with UCX -- you'll need
> to ask
> > Nvidia for support there -- but I was under the impression that it's
> just a
> > user-level library, and you could certainly install your own copy of UCX
> to use
> > with your compilation of Open MPI.  I.e., you're not restricted to
> whatever UCX
> > is installed in the cluster system-default locations.
>
> I did follow your advice, so I compiled my own version of UCX (1.11.2)
> and OpenMPI v4.1.1, but for some reason the latency / bandwidth numbers
> are really bad compared to the previous ones, so something is wrong, but
> not sure how to debug it.
>
> > I don't know why you're getting MXM-specific error messages; those don't
> appear
> > to be coming from Open MPI (especially since you configured Open MPI with
> > --without-mxm).  If you can upgrade to Open MPI v4.1.2 and the latest
> UCX, see
> > if you are still getting those MXM error messages.
>
> In this latest attempt, yes, the MXM error messages are still there.
>
> Cheers,
> --
> Ángel de Vicente
>
> Tel.: +34 922 605 747
> Web.: http://research.iac.es/proyecto/polmag/
>
> -
> AVISO LEGAL: Este mensaje puede contener información confidencial y/o
> privilegiada. Si usted no es el destinatario final del mismo o lo ha
> recibido por error, por favor notifíquelo al remitente inmediatamente.
> Cualquier uso no autorizadas del contenido de este mensaje está
> estrictamente prohibida. Más información en:
> https://www.iac.es/es/responsabilidad-legal
> DISCLAIMER: This message may contain confidential and / or privileged
> information. If you are not the final recipient or have received it in
> error, please notify the sender immediately. Any unauthorized use of the
> content of this message is strictly prohibited. More information:
> https://www.iac.es/en/disclaimer
>