Re: [hwloc-users] Problems with binding memory
Le 01/03/2022 à 17:34, Mike a écrit : Hello, Usually you would rather allocate and bind at the same time so that the memory doesn't need to be migrated when bound. However, if you do not touch the memory after allocation, pages are not actually physically allocated, hence there's no to migrate. Might work but keep this in mind. I need all the data in one allocation, so that is why I opted to allocate and then bind via the area function. The way I understand it is that by using the memory binding policy HWLOC_MEMBIND_BIND with hwloc_set_area_membind() the pages will actually get allocated on the specified cores. If that is not the case I suppose the best solution would be to just touch the allocated data with my threads. set_area_membind() doesn't allocate pages, but it tells the operating system "whenever you allocate them, do it on that NUMA node". Anyway, what you're doing makes sense. Can you print memory binding like below instead of printing only the first PU in the set returned by get_area_membind? char *s; hwloc_bitmap_asprintf(, set); /* s is now a C string of the bitmap, use it in your std::cout line */ I tried that and now get_area_membind returns that all memory is bound to 0x,0x,,,0x,0x Please run "lstopo -.synthetic" to compress the output a lot. I will be able to reuse it from here and understand your binding mask. Brice OpenPGP_signature Description: OpenPGP digital signature ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [OMPI users] Trouble compiling OpenMPI with Infiniband support
These are very, very old versions of UCX and HCOLL installed in your environment. Also, MXM was deprecated years ago in favor of UCX. What version of MOFED is installed (run ofed_info -s)? What HCA generation is present (run ibstat). Josh On Tue, Mar 1, 2022 at 6:42 AM Angel de Vicente via users < users@lists.open-mpi.org> wrote: > Hello, > > John Hearns via users writes: > > > Stupid answer from me. If latency/bandwidth numbers are bad then check > > that you are really running over the interface that you think you > > should be. You could be falling back to running over Ethernet. > > I'm quite out of my depth here, so all answers are helpful, as I might have > skipped something very obvious. > > In order to try and avoid the possibility of falling back to running > over Ethernet, I submitted the job with: > > mpirun -n 2 --mca btl ^tcp osu_latency > > which gives me the following error: > > , > | At least one pair of MPI processes are unable to reach each other for > | MPI communications. This means that no Open MPI device has indicated > | that it can be used to communicate between these processes. This is > | an error; Open MPI requires that all MPI processes be able to reach > | each other. This error can sometimes be the result of forgetting to > | specify the "self" BTL. > | > | Process 1 ([[37380,1],1]) is on host: s01r1b20 > | Process 2 ([[37380,1],0]) is on host: s01r1b19 > | BTLs attempted: self > | > | Your MPI job is now going to abort; sorry. > ` > > This is certainly not happening when I use the "native" OpenMPI, > etc. provided in the cluster. I have not knowingly specified anywhere > not to support "self", so I have no clue what might be going on, as I > assumed that "self" was always built for OpenMPI. > > Any hints on what (and where) I should look for? > > Many thanks, > -- > Ángel de Vicente > > Tel.: +34 922 605 747 > Web.: http://research.iac.es/proyecto/polmag/ > > - > AVISO LEGAL: Este mensaje puede contener información confidencial y/o > privilegiada. Si usted no es el destinatario final del mismo o lo ha > recibido por error, por favor notifíquelo al remitente inmediatamente. > Cualquier uso no autorizadas del contenido de este mensaje está > estrictamente prohibida. Más información en: > https://www.iac.es/es/responsabilidad-legal > DISCLAIMER: This message may contain confidential and / or privileged > information. If you are not the final recipient or have received it in > error, please notify the sender immediately. Any unauthorized use of the > content of this message is strictly prohibited. More information: > https://www.iac.es/en/disclaimer >
Re: [hwloc-users] Problems with binding memory
Hello, Usually you would rather allocate and bind at the same time so that the > memory doesn't need to be migrated when bound. However, if you do not touch > the memory after allocation, pages are not actually physically allocated, > hence there's no to migrate. Might work but keep this in mind. > I need all the data in one allocation, so that is why I opted to allocate and then bind via the area function. The way I understand it is that by using the memory binding policy HWLOC_MEMBIND_BIND with hwloc_set_area_membind() the pages will actually get allocated on the specified cores. If that is not the case I suppose the best solution would be to just touch the allocated data with my threads. Can you print memory binding like below instead of printing only the first > PU in the set returned by get_area_membind? > char *s; hwloc_bitmap_asprintf(, set); /* s is now a C string of the bitmap, use it in your std::cout line */ I tried that and now get_area_membind returns that all memory is bound to 0x,0x,,,0x,0x > > People often do the contrary. They bind threads, and then they have > threads allocate/touch memory so that buffers are physically allocated near > the related threads (automatic by default). It works well when the number > of threads is known in advance. You place one thread per core, they never > move. As long as memory is big enough to store the data nearby, everybody's > happy. If the number of threads varies at runtime, and/or if they need to > move, things become more difficult. > > Your approach is also correct. In the end, it's rather a question of > whether you're code is data-centric or compute-centric, and whether > imbalances may require to move things during the execution. Moving threads > is usually cheaper. But oversubscribing cores with multiple threads is > usually a bad idea, that's likely why people place one thread per core > first. > My code is rather data-bound and my main motivation for binding the threads is because I did not want hyperthreading on cores and because I want to keep all threads that operate on the same data in one L3 Cache. And send the output of lstopo on your machine so that I can understand it. > The machine has two sockets and on each socket are 64 cores. Cores 0-7 share one L3 cache, so do cores 8-15 and so on. The output of lstopo is quite large, but if my description does not suffice I can send it. Thanks for your time Mike Am Di., 1. März 2022 um 15:42 Uhr schrieb Brice Goglin < brice.gog...@inria.fr>: > > Le 01/03/2022 à 15:17, Mike a écrit : > > Dear list, > > I have a program that utilizes Openmpi + multithreading and I want the > freedom to decide on which hardware cores my threads should run. By using > hwloc_set_cpubind() that already works, so now I also want to bind memory > to the hardware cores. But I just can't get it to work. > > Basically, I wrote the memory binding into my allocator, so the memory > will be allocated and then bound. > > > Hello > > Usually you would rather allocate and bind at the same time so that the > memory doesn't need to be migrated when bound. However, if you do not touch > the memory after allocation, pages are not actually physically allocated, > hence there's no to migrate. Might work but keep this in mind. > > > I use hwloc 2.4.1, run the code on a Linux system and I did check with > “hwloc-info --support” if hwloc_set_area_membind() and > hwloc_get_area_membind() are supported and they are. > > Here is a snippet of my code, which runs through without any error. But > the hwloc_get_area_membind() always returns that all memory is bound to PU > 0, when I think it should be bound to different PUs. Am I missing something? > > > Can you print memory binding like below instead of printing only the first > PU in the set returned by get_area_membind? > > char *s; > hwloc_bitmap_asprintf(, set); > /* s is now a C string of the bitmap, use it in your std::cout line */ > > And send the output of lstopo on your machine so that I can understand it. > > Or you could print the smallest object that contains the binding by > calling hwloc_get_obj_covering_cpuset(topology, set). It returns an object > whose type may be printed as a C-string with > hwloc_obj_type_string(obj->type). > > You may also do the same before set_area_membind() if you want to verify > that you're bindin where you really want. > > > > T* allocate(size_t n, hwloc_topology_t topology, int rank) > { > // allocate memory > T* t = (T*)hwloc_alloc(topology, sizeof(T) * n); > // elements perthread > size_t ept = 1024; > hwloc_bitmap_t set; > size_t offset = 0; > size_t threadcount= 4; > > set = hwloc_bitmap_alloc(); > if(!set) { > fprintf(stderr, "failed to allocate a bitmap\n"); > } > // bind memory to every thread > for(size_t i = 0;i < threadcount; i++) > { > // logical indexof where to bind the memory > auto logid = (i +rank * threadcount) * 2; > auto logobj =
Re: [hwloc-users] Problems with binding memory
Le 01/03/2022 à 15:17, Mike a écrit : Dear list, I have a program that utilizes Openmpi + multithreading and I want the freedom to decide on which hardware cores my threads should run. By using hwloc_set_cpubind() that already works, so now I also want to bind memory to the hardware cores. But I just can't get it to work. Basically, I wrote the memory binding into my allocator, so the memory will be allocated and then bound. Hello Usually you would rather allocate and bind at the same time so that the memory doesn't need to be migrated when bound. However, if you do not touch the memory after allocation, pages are not actually physically allocated, hence there's no to migrate. Might work but keep this in mind. I use hwloc 2.4.1, run the code on a Linux system and I did check with “hwloc-info --support” if hwloc_set_area_membind() and hwloc_get_area_membind() are supported and they are. Here is a snippet of my code, which runs through without any error. But the hwloc_get_area_membind() always returns that all memory is bound to PU 0, when I think it should be bound to different PUs. Am I missing something? Can you print memory binding like below instead of printing only the first PU in the set returned by get_area_membind? char *s; hwloc_bitmap_asprintf(, set); /* s is now a C string of the bitmap, use it in your std::cout line */ And send the output of lstopo on your machine so that I can understand it. Or you could print the smallest object that contains the binding by calling hwloc_get_obj_covering_cpuset(topology, set). It returns an object whose type may be printed as a C-string with hwloc_obj_type_string(obj->type). You may also do the same before set_area_membind() if you want to verify that you're bindin where you really want. T* allocate(size_t n, hwloc_topology_t topology, int rank) { // allocate memory T* t = (T*)hwloc_alloc(topology, sizeof(T) * n); // elements perthread size_t ept = 1024; hwloc_bitmap_t set; size_t offset = 0; size_t threadcount= 4; set = hwloc_bitmap_alloc(); if(!set) { fprintf(stderr, "failed to allocate a bitmap\n"); } // bind memory to every thread for(size_t i = 0;i < threadcount; i++) { // logical indexof where to bind the memory auto logid = (i +rank * threadcount) * 2; auto logobj = hwloc_get_obj_by_type(topology, HWLOC_OBJ_PU, logid); hwloc_bitmap_only(set, logobj->os_index); //set the memory binding // I use HWLOC_MEMBIND_BIND as policy so I do not have to touch the memory first to allocate it auto err = hwloc_set_area_membind(topology, t + offset, sizeof(T) *ept, set, HWLOC_MEMBIND_BIND, HWLOC_MEMBIND_STRICT | HWLOC_MEMBIND_THREAD); if(err < 0) std::cout << "Error: memory binding failed" < std::cout << "Rank=" << rank << " Tid=" << i << " on PU logical index="
[hwloc-users] Problems with binding memory
Dear list, I have a program that utilizes Openmpi + multithreading and I want the freedom to decide on which hardware cores my threads should run. By using hwloc_set_cpubind() that already works, so now I also want to bind memory to the hardware cores. But I just can't get it to work. Basically, I wrote the memory binding into my allocator, so the memory will be allocated and then bound. I use hwloc 2.4.1, run the code on a Linux system and I did check with “hwloc-info --support” if hwloc_set_area_membind() and hwloc_get_area_membind() are supported and they are. Here is a snippet of my code, which runs through without any error. But the hwloc_get_area_membind() always returns that all memory is bound to PU 0, when I think it should be bound to different PUs. Am I missing something? T* allocate(size_t n, hwloc_topology_t topology, int rank) { // allocate memory T* t = (T*)hwloc_alloc(topology, sizeof(T) * n); // elements perthread size_t ept = 1024; hwloc_bitmap_t set; size_t offset = 0; size_t threadcount= 4; set = hwloc_bitmap_alloc(); if(!set) { fprintf(stderr, "failed to allocate a bitmap\n"); } // bind memory to every thread for(size_t i = 0;i < threadcount; i++) { // logical indexof where to bind the memory auto logid = (i +rank * threadcount) * 2; auto logobj = hwloc_get_obj_by_type(topology, HWLOC_OBJ_PU, logid); hwloc_bitmap_only(set, logobj->os_index); //set the memory binding // I use HWLOC_MEMBIND_BIND as policy so I do not have to touch the memory first to allocate it auto err = hwloc_set_area_membind(topology, t + offset, sizeof(T) *ept, set, HWLOC_MEMBIND_BIND, HWLOC_MEMBIND_STRICT | HWLOC_MEMBIND_THREAD); if(err < 0) std::cout << "Error: memory binding failed" <___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [OMPI users] Trouble compiling OpenMPI with Infiniband support
Hello, John Hearns via users writes: > Stupid answer from me. If latency/bandwidth numbers are bad then check > that you are really running over the interface that you think you > should be. You could be falling back to running over Ethernet. I'm quite out of my depth here, so all answers are helpful, as I might have skipped something very obvious. In order to try and avoid the possibility of falling back to running over Ethernet, I submitted the job with: mpirun -n 2 --mca btl ^tcp osu_latency which gives me the following error: , | At least one pair of MPI processes are unable to reach each other for | MPI communications. This means that no Open MPI device has indicated | that it can be used to communicate between these processes. This is | an error; Open MPI requires that all MPI processes be able to reach | each other. This error can sometimes be the result of forgetting to | specify the "self" BTL. | | Process 1 ([[37380,1],1]) is on host: s01r1b20 | Process 2 ([[37380,1],0]) is on host: s01r1b19 | BTLs attempted: self | | Your MPI job is now going to abort; sorry. ` This is certainly not happening when I use the "native" OpenMPI, etc. provided in the cluster. I have not knowingly specified anywhere not to support "self", so I have no clue what might be going on, as I assumed that "self" was always built for OpenMPI. Any hints on what (and where) I should look for? Many thanks, -- Ángel de Vicente Tel.: +34 922 605 747 Web.: http://research.iac.es/proyecto/polmag/ - AVISO LEGAL: Este mensaje puede contener información confidencial y/o privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso no autorizadas del contenido de este mensaje está estrictamente prohibida. Más información en: https://www.iac.es/es/responsabilidad-legal DISCLAIMER: This message may contain confidential and / or privileged information. If you are not the final recipient or have received it in error, please notify the sender immediately. Any unauthorized use of the content of this message is strictly prohibited. More information: https://www.iac.es/en/disclaimer
Re: [OMPI users] Trouble compiling OpenMPI with Infiniband support
Stupid answer from me. If latency/bandwidth numbers are bad then check that you are really running over the interface that you think you should be. You could be falling back to running over Ethernet. On Mon, 28 Feb 2022 at 20:10, Angel de Vicente via users < users@lists.open-mpi.org> wrote: > Hello, > > "Jeff Squyres (jsquyres)" writes: > > > I'd recommend against using Open MPI v3.1.0 -- it's quite old. If you > > have to use Open MPI v3.1.x, I'd at least suggest using v3.1.6, which > > has all the rolled-up bug fixes on the v3.1.x series. > > > > That being said, Open MPI v4.1.2 is the most current. Open MPI v4.1.2 > does > > restrict which versions of UCX it uses because there are bugs in the > older > > versions of UCX. I am not intimately familiar with UCX -- you'll need > to ask > > Nvidia for support there -- but I was under the impression that it's > just a > > user-level library, and you could certainly install your own copy of UCX > to use > > with your compilation of Open MPI. I.e., you're not restricted to > whatever UCX > > is installed in the cluster system-default locations. > > I did follow your advice, so I compiled my own version of UCX (1.11.2) > and OpenMPI v4.1.1, but for some reason the latency / bandwidth numbers > are really bad compared to the previous ones, so something is wrong, but > not sure how to debug it. > > > I don't know why you're getting MXM-specific error messages; those don't > appear > > to be coming from Open MPI (especially since you configured Open MPI with > > --without-mxm). If you can upgrade to Open MPI v4.1.2 and the latest > UCX, see > > if you are still getting those MXM error messages. > > In this latest attempt, yes, the MXM error messages are still there. > > Cheers, > -- > Ángel de Vicente > > Tel.: +34 922 605 747 > Web.: http://research.iac.es/proyecto/polmag/ > > - > AVISO LEGAL: Este mensaje puede contener información confidencial y/o > privilegiada. Si usted no es el destinatario final del mismo o lo ha > recibido por error, por favor notifíquelo al remitente inmediatamente. > Cualquier uso no autorizadas del contenido de este mensaje está > estrictamente prohibida. Más información en: > https://www.iac.es/es/responsabilidad-legal > DISCLAIMER: This message may contain confidential and / or privileged > information. If you are not the final recipient or have received it in > error, please notify the sender immediately. Any unauthorized use of the > content of this message is strictly prohibited. More information: > https://www.iac.es/en/disclaimer >