Re: [OMPI users] hwloc: Topology became empty, aborting!

2023-08-01 Thread Brice Goglin via users
Hello This is a hwloc issue, the mailing list is hwloc-us...@lists.open-mpi.org (please update the CCed address if you reply to this message). Try building with --enable-debug to get a lot of debug messages in lstopo. Or run "hwloc-gather-topology foo" and send the resulting foo.tar.gz

Re: [OMPI users] Regarding process binding on OS X with oversubscription

2022-03-17 Thread Brice Goglin via users
Hello OS X doesn't support binding at all, that's why hwloc and OpenMPI don't support it either. Brice Le 17/03/2022 à 20:23, Sajid Ali via users a écrit : Hi OpenMPI-developers, When trying to run a program with process binding and oversubscription (on a github actions CI instance)

Re: [OMPI users] hwloc error

2021-08-23 Thread Brice Goglin via users
Hello Dwaipayan You seem to be running a very old hwloc (maybe embedded inside an old Open MPI release?). Can you install a more recent hwloc from https://www.open-mpi.org/projects/hwloc/, build it, and run its "lstopo" to check whether the error remains? If so, could you open an issue on

Re: [OMPI users] mpirun on Kubuntu 20.4.1 hangs

2020-11-14 Thread Brice Goglin via users
Hello The hwloc/X11 stuff is caused by OpenMPI using a hwloc that was built with the GL backend enabled (in your case, it's because package libhwloc-plugins is installed). That backend is used for querying the locality of X11 displays running on NVIDIA GPUs (using libxnvctrl). Does running

Re: [OMPI users] kernel trap - divide by zero

2020-04-29 Thread Brice Goglin via users
Hello Both 1.10.1 and 1.11.10 are vry old. Any chance you try at least 1.11.13 or even 2.x on these machines? I can't remember all what we changed in this code 5 years later unfortunately. We are not aware of any issue of Intel haswell but it's not impossible something is buggy in the

Re: [OMPI users] topology.c line 940?

2020-03-30 Thread Brice Goglin via users
Hello This is hwloc (the hardware detection tool) complaining that something is wrong in your hardware or operating system. It won't prevent your MPI code from working, however process binding may not be optimal. You may want to upgrade your operating system kernel and/or BIOS. If you want to

Re: [OMPI users] [External] Re: AMD EPYC 7281: does NOT, support binding memory to the process location

2020-01-08 Thread Brice Goglin via users
Le 08/01/2020 à 21:51, Prentice Bisbal via users a écrit : > > On 1/8/20 3:30 PM, Brice Goglin via users wrote: >> Le 08/01/2020 à 21:20, Prentice Bisbal via users a écrit : >>> We just added about a dozen nodes to our cluster, which have AMD EPYC >>> 7281 processors

Re: [OMPI users] AMD EPYC 7281: does NOT, support binding memory to the process location

2020-01-08 Thread Brice Goglin via users
Le 08/01/2020 à 21:20, Prentice Bisbal via users a écrit : > We just added about a dozen nodes to our cluster, which have AMD EPYC > 7281 processors. When a particular users jobs fall on one of these > nodes, he gets these error messages: > >

Re: [OMPI users] Communicator Split Type NUMA Behavior

2019-11-27 Thread Brice Goglin via users
The attached patch (against 4.0.2) should fix it, I'll prepare a PR to fix this upstream. Brice Le 27/11/2019 à 00:41, Brice Goglin via users a écrit : > It looks like NUMA is broken, while others such as SOCKET and L3CACHE > work fine. A quick look in opal_hwloc_base_get_relative_lo

Re: [OMPI users] Communicator Split Type NUMA Behavior

2019-11-26 Thread Brice Goglin via users
It looks like NUMA is broken, while others such as SOCKET and L3CACHE work fine. A quick look in opal_hwloc_base_get_relative_locality() and friends tells me that those functions were not properly updated to hwloc 2.0 NUMA changes. I'll try to understand what's going on tomorrow. Rebuilding OMPI

[OMPI users] disabling ucx over omnipath

2019-11-15 Thread Brice Goglin via users
Hello We have a platform with an old MLX4 partition and another OPA partition. We want a single OMPI installation working for both kinds of nodes. When we enable UCX in OMPI for MLX4, UCX ends up being used on the OPA partition too, and the performance is poor (3GB/s instead of 10). The problem