Re: [OMPI users] hwloc: Topology became empty, aborting!

2023-08-01 Thread Brice Goglin via users
Hello This is a hwloc issue, the mailing list is hwloc-us...@lists.open-mpi.org (please update the CCed address if you reply to this message). Try building with --enable-debug to get a lot of debug messages in lstopo. Or run "hwloc-gather-topology foo" and send the resulting foo.tar.gz (you

Re: [OMPI users] Regarding process binding on OS X with oversubscription

2022-03-17 Thread Brice Goglin via users
Hello OS X doesn't support binding at all, that's why hwloc and OpenMPI don't support it either. Brice Le 17/03/2022 à 20:23, Sajid Ali via users a écrit : Hi OpenMPI-developers, When trying to run a program with process binding and oversubscription (on a github actions CI instance) with

Re: [OMPI users] hwloc error

2021-08-23 Thread Brice Goglin via users
Hello Dwaipayan You seem to be running a very old hwloc (maybe embedded inside an old Open MPI release?). Can you install a more recent hwloc from https://www.open-mpi.org/projects/hwloc/, build it, and run its "lstopo" to check whether the error remains? If so, could you open an issue on th

Re: [OMPI users] mpirun on Kubuntu 20.4.1 hangs

2020-11-14 Thread Brice Goglin via users
Hello The hwloc/X11 stuff is caused by OpenMPI using a hwloc that was built with the GL backend enabled (in your case, it's because package libhwloc-plugins is installed). That backend is used for querying the locality of X11 displays running on NVIDIA GPUs (using libxnvctrl). Does running "lstopo

Re: [OMPI users] kernel trap - divide by zero

2020-04-29 Thread Brice Goglin via users
Hello Both 1.10.1 and 1.11.10 are vry old. Any chance you try at least 1.11.13 or even 2.x on these machines? I can't remember all what we changed in this code 5 years later unfortunately. We are not aware of any issue of Intel haswell but it's not impossible something is buggy in the hardwar

Re: [OMPI users] topology.c line 940?

2020-03-29 Thread Brice Goglin via users
Hello This is hwloc (the hardware detection tool) complaining that something is wrong in your hardware or operating system. It won't prevent your MPI code from working, however process binding may not be optimal. You may want to upgrade your operating system kernel and/or BIOS. If you want to deb

Re: [OMPI users] [External] Re: AMD EPYC 7281: does NOT, support binding memory to the process location

2020-01-08 Thread Brice Goglin via users
Le 08/01/2020 à 21:51, Prentice Bisbal via users a écrit : > > On 1/8/20 3:30 PM, Brice Goglin via users wrote: >> Le 08/01/2020 à 21:20, Prentice Bisbal via users a écrit : >>> We just added about a dozen nodes to our cluster, which have AMD EPYC >>> 7281 processors

Re: [OMPI users] AMD EPYC 7281: does NOT, support binding memory to the process location

2020-01-08 Thread Brice Goglin via users
Le 08/01/2020 à 21:20, Prentice Bisbal via users a écrit : > We just added about a dozen nodes to our cluster, which have AMD EPYC > 7281 processors. When a particular users jobs fall on one of these > nodes, he gets these error messages: > >

Re: [OMPI users] Communicator Split Type NUMA Behavior

2019-11-27 Thread Brice Goglin via users
The attached patch (against 4.0.2) should fix it, I'll prepare a PR to fix this upstream. Brice Le 27/11/2019 à 00:41, Brice Goglin via users a écrit : > It looks like NUMA is broken, while others such as SOCKET and L3CACHE > work fine. A quick look in opal_hwloc_base_get_relati

Re: [OMPI users] Communicator Split Type NUMA Behavior

2019-11-26 Thread Brice Goglin via users
It looks like NUMA is broken, while others such as SOCKET and L3CACHE work fine. A quick look in opal_hwloc_base_get_relative_locality() and friends tells me that those functions were not properly updated to hwloc 2.0 NUMA changes. I'll try to understand what's going on tomorrow. Rebuilding OMPI w

[OMPI users] disabling ucx over omnipath

2019-11-15 Thread Brice Goglin via users
Hello We have a platform with an old MLX4 partition and another OPA partition. We want a single OMPI installation working for both kinds of nodes. When we enable UCX in OMPI for MLX4, UCX ends up being used on the OPA partition too, and the performance is poor (3GB/s instead of 10). The problem se

Re: [OMPI users] new core binding issues?

2018-06-22 Thread Brice Goglin
If psr is the processor where the task is actually running, I guess we'd need your lstopo output to find out where those processors are in the machine. Brice Le 22 juin 2018 19:13:42 GMT+02:00, Noam Bernstein a écrit : >> On Jun 22, 2018, at 1:00 PM, r...@open-mpi.org wrote: >> >> I suspect

Re: [OMPI users] hwloc, OpenMPI and unsupported OSes and toolchains

2018-03-20 Thread Brice Goglin
Hello I am available for off-line discussion for the hwloc side of things. But things look complicated here from your summary below. I guess there's no need for binding on such a system. And topology is quite simple, so it might be easier to hardwire everything. Brice Le 20/03/2018 à 08:36, Ti

Re: [OMPI users] bind-to-core with AMD CMT?

2017-08-29 Thread Brice Goglin
Yes, they share L2 and L1i. Brice Le 30/08/2017 02:16, Gilles Gouaillardet a écrit : > Prentice, > > could you please run > lstopo --of=xml > and post the output ? > > a simple workaround could be to bind each task to two consecutive cores > (assuming two consecutive cores share the same FPU, w

Re: [OMPI users] NUMA interaction with Open MPI

2017-07-20 Thread Brice Goglin
Hello Mems_allowed_list is what your current cgroup/cpuset allows. It is different from what mbind/numactl/hwloc/... change. The former is a root-only restriction that cannot be ignored by processes placed in that cgroup. The latter is a user-changeable binding that must be inside the former. Bri

Re: [OMPI users] waiting for message either from MPI communicator or from TCP socket

2017-06-25 Thread Brice Goglin
Usually, when you want to listen to two kinds of event, you use poll/select on Unix (or epoll on Linux). This strategy doesn't work for MPI events because MPI doesn't provide a Unix file descriptor to pass to poll/select/epoll. One work-around is to have one thread listen to MPI events. When it rec

Re: [OMPI users] "No objects of the specified type were found on at least one node"

2017-03-09 Thread Brice Goglin
SUSE Linux Enterprise Server 10 (ppc) > Release:10 > > > On 9 March 2017 at 15:04, Brice Goglin <mailto:brice.gog...@inria.fr>> wrote: > > What's this machine made of? (processor, etc) > What kernel are you running ? > > Gettin

Re: [OMPI users] "No objects of the specified type were found on at least one node"

2017-03-09 Thread Brice Goglin
What's this machine made of? (processor, etc) What kernel are you running ? Getting no "socket" or "package" at all is quite rare these days. Brice Le 09/03/2017 15:28, Angel de Vicente a écrit : > Hi again, > > thanks for your help. I installed the latest OpenMPI (2.0.2). > > lstopo output:

Re: [OMPI users] Problem building OpenMPI with CUDA 8.0

2016-10-24 Thread Brice Goglin
FWIW, I am still open to implementing something to workaround this in hwloc. Could be shell variable such as HWLOC_DISABLE_NVML=yes for all our major configured dependencies. Brice Le 24/10/2016 02:12, Gilles Gouaillardet a écrit : > Justin, > > > iirc, NVML is only used by hwloc (e.g. not by C

Re: [OMPI users] Compilation without NVML support

2016-09-20 Thread Brice Goglin
Hello Assuming this NVML detection is actually done by hwloc, I guess there's nothing in OMPI to disable it. It's not the first time we get such an issue with OMPI not having all hwloc's --disable-foo options, but I don't think we actually want to propagate all of them. Maybe we should just force s

Re: [OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?

2016-06-27 Thread Brice Goglin
mlock() and mlockall() only guarantee that pages won't be swapped out to the disk. However, they don't prevent virtual pages from moving to other physical pages (for instance during migration between NUMA nodes), which breaks memory registration. At least this was true a couple years ago, I didn't

Re: [OMPI users] MX replacement?

2016-02-02 Thread Brice Goglin
Le 02/02/2016 15:21, Jeff Squyres (jsquyres) a écrit : > On Feb 2, 2016, at 9:00 AM, Dave Love wrote: >> Now that MX support has been dropped, is there an alternative for fast >> Ethernet? > There are several options for low latency ethernet, but they're all > vendor-based solutions (e.g., my co

Re: [OMPI users] Unable to compile for libnumactl and libnumactl-devel

2015-10-29 Thread Brice Goglin
Le 29/10/2015 21:04, Fabian Wein a écrit : >> If you're compiling Open MPI from source, you need the -devel package >> so that the libnuma header files are installed (and therefore Open >> MPI [i.e., the hwloc embedded in Open MPI] can include those header >> files and then compile support for li

Re: [OMPI users] Using POSIX shared memory as send buffer

2015-10-02 Thread Brice Goglin
Le 28/09/2015 21:44, Dave Goodell (dgoodell) a écrit : > It may have to do with NUMA effects and the way you're allocating/touching > your shared memory vs. your private (malloced) memory. If you have a > multi-NUMA-domain system (i.e., any 2+ socket server, and even some > single-socket serv

[OMPI users] EuroMPI 2015 Call for Participation - Early deadline Sept 1st

2015-08-26 Thread Brice Goglin
Hardware Locality (hwloc) Brice Goglin, INRIA Bordeaux sud-Ouest - Insightful Automatic Performance Modeling Alexandru Calotoiu, TU Darmstadt INVITED TALKS --- - A new high performance network for HPC systems: Bull eXascale Interconnect (BXI), Jean-Pierre Panziera

Re: [OMPI users] Anyone successfully running Abaqus with OpenMPI?

2015-06-22 Thread Brice Goglin
Hello Can you send more details about the incompatibility between hwloc old and recent versions? Maybe there's a workaround there. hwloc is supposed to maintain compatibility but we've seen cases where XML export/import doesn't work because the old version exports buggy XMLs that the recent version

Re: [OMPI users] new hwloc error

2015-04-29 Thread Brice Goglin
Le 29/04/2015 22:25, Noam Bernstein a écrit : >> On Apr 29, 2015, at 4:09 PM, Brice Goglin wrote: >> >> Nothing wrong in that XML. I don't see what could be happening besides a >> node rebooting with hyper-threading enabled for random reasons. >> Please run &quo

Re: [OMPI users] new hwloc error

2015-04-29 Thread Brice Goglin
Le 29/04/2015 18:55, Noam Bernstein a écrit : >> On Apr 29, 2015, at 12:47 PM, Brice Goglin wrote: >> >> Thanks. It's indeed normal that OMPI failed to bind to cpuset 0,16 since >> 16 doesn't exist at all. >> Can you run "lstopo foo.xml" on one

Re: [OMPI users] new hwloc error

2015-04-29 Thread Brice Goglin
Le 29/04/2015 14:53, Noam Bernstein a écrit : > They’re dual 8-core processor, so the 16 cores are physical ones. lstopo > output looks identical on nodes where this does happen, and nodes where it > never does. My next step is to see if I can reproduce the behavior at will - > I’m still n

Re: [OMPI users] new hwloc error

2015-04-28 Thread Brice Goglin
Hello, Can you build hwloc and run lstopo on these nodes to check that everything looks similar? You have hyperthreading enabled on all nodes, and you're trying to bind processes to entire cores, right? Does 0,16 correspond to two hyperthreads within a single core on these nodes? (lstopo -p should

[OMPI users] EuroMPI 2015 Call for Papers

2015-03-26 Thread Brice Goglin
cial issue of International Journal of High Performance Computing Applications (IJHPCA). COMMITTEE - General chair: Jack Dongarra, University of Tennessee Program co-chairs: Alexandre Denis, Inria Bordeaux Sud-Ouest Brice Goglin, Inria Bordeaux Sud-Ouest Emmanuel Jeannot, Inria Bordeaux

Re: [OMPI users] 3. Re: Hwloc error with Openmpi 1.8.3 on AMD 64 (Brice Goglin)

2014-12-20 Thread Brice Goglin
27;s Topics: > > > > 1. Re: Deadlock in OpenMPI 1.8.3 and PETSc 3.4.5 > > (Jeff Squyres (jsquyres)) > > 2. Hwloc error with Openmpi 1.8.3 on AMD 64 (Sergio Manzetti) > > 3. Re: Hwloc error with Openmpi 1.8.3 on AMD 64 (Brice Goglin) > > 4. best function to send d

Re: [OMPI users] Hwloc error with Openmpi 1.8.3 on AMD 64

2014-12-19 Thread Brice Goglin
Hello, The rationale is to read the message and do what it says :) Have a look at www.open-mpi.org/projects/hwloc/doc/v1.10.0/a00028.php#faq_os_error Try upgrading your BIOS and kernel. Otherwise install hwloc and send the output (tarball) of hwloc-gather-topology to hwloc-users (not to OMPI

Re: [OMPI users] OpenMPI 1.8.4 and hwloc in Fedora 14 using a beta gcc 5.0 compiler.

2014-12-15 Thread Brice Goglin
er, we can just pull a new hwloc tarball -- that's >> how we've done it in the past (vs. trying to pull individual patches). It's >> also easier to pull a release tarball, because then we can say "hwloc vX.Y.Z >> is in OMPI vA.B.C", rather than have to

Re: [OMPI users] OpenMPI 1.8.4 and hwloc in Fedora 14 using a beta gcc 5.0 compiler.

2014-12-15 Thread Brice Goglin
Le 15/12/2014 10:35, Jorge D'Elia a écrit : > Hi Brice, > > - Mensaje original - >> De: "Brice Goglin" >> CC: "Open MPI Users" >> Enviado: Jueves, 11 de Diciembre 2014 19:46:44 >> Asunto: Re: [OMPI users] OpenMPI 1.8.4 and hwloc

Re: [OMPI users] OpenMPI 1.8.4 and hwloc in Fedora 14 using a beta gcc 5.0 compiler.

2014-12-11 Thread Brice Goglin
This problem was fixed in hwloc upstream recently. https://github.com/open-mpi/hwloc/commit/790aa2e1e62be6b4f37622959de9ce3766ebc57e Brice Le 11/12/2014 23:40, Jorge D'Elia a écrit : > Dear Jeff, > > Our updates of OpenMPI to 1.8.3 (and 1.8.4) were > all OK using Fedora >= 17 and system gcc com

Re: [OMPI users] netloc

2014-12-05 Thread Brice Goglin
This is a netloc question, not a Open MPI question, so please ask on netloc-us...@open-mpi.org instead. You should attach some of your lstopo outputs to ease debugging. Make sure your lstopo reports I/O devices, it's required for netloc. thanks Brice Le 05/12/2014 21:55, Faraj, Daniel A a écri

Re: [OMPI users] knem in Open MPI 1.8.3

2014-10-31 Thread Brice Goglin
Le 31/10/2014 00:24, Gus Correa a écrit : > 2) Any recommendation for the values of the > various vader btl parameters? > [There are 12 of them in OMPI 1.8.3! > That is real challenge to get right.] > > Which values did you use in your benchmarks? > Defaults? > Other? > > In particular, is there an

[OMPI users] engineer position on hwloc+netloc

2014-10-30 Thread Brice Goglin
Hello, There's an R&D engineer position opening in my research team at Inria Bordeaux (France) for developing hwloc and netloc software (both Open MPI subprojects). All details available at http://runtime.bordeaux.inria.fr/goglin/201410-Engineer-hwloc+netloc.en.pdf or French version http://runt

Re: [OMPI users] mxm 3.0 and knem warnings

2014-08-27 Thread Brice Goglin
Hello Brock, Some people complained that giving world-wide access to a device file by default might be bad if we ever find a security leak in the kernel module. So I needed a better default. The rdma group is often used for OFED devices, and OFED and KNEM users are often the same, so it was a good

Re: [OMPI users] Compiling OpenMPI for Intel Xeon Phi/MIC

2014-06-26 Thread Brice Goglin
Here's what I used to build 1.8.1 with Intel 13.5 recently: module load compiler/13.5.192 export PATH=/usr/linux-k1om-4.7/bin/:$PATH ../configure --prefix=/path/to/your/ompi/install \ CC="icc -mmic" CXX="icpc -mmic" \ --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \ AR=x86_64-k1o

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-03-03 Thread Brice Goglin
Le 03/03/2014 23:02, Gus Correa a écrit : > I rebooted the node and ran hwloc-gather-topology again. > This turn it didn't throw any errors on the terminal window, > which may be a good sign. > > [root@node14 ~]# hwloc-gather-topology /tmp/`date > +"%Y%m%d%H%M"`.$(uname -n) > Hierarchy gathered in

Re: [OMPI users] Heterogeneous cluster problem - mixing AMD and Intel nodes

2014-03-02 Thread Brice Goglin
What's your mpirun or mpiexec command-line? The error "BTLs attempted: self sm tcp" says that it didn't even try the MX BTL (for Open-MX). Did you use the MX MTL instead? Are you sure that you actually use Open-MX when not mixing AMD and Intel nodes? Brice Le 02/03/2014 08:06, Victor a écrit :

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Brice Goglin
Le 28/02/2014 21:30, Gus Correa a écrit : > Hi Brice > > The (pdf) output of lstopo shows one L1d (16k) for each core, > and one L1i (64k) for each *pair* of cores. > Is this wrong? It's correct. AMD uses this "dual-core compute unit" where L2 and L1i are shared but L1d isn't. > BTW, if there are

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Brice Goglin
d see if there is anything unusual with the > hardware, and perhaps reinstall the OS, as Ralph suggested. > It is awkward that the other node that had the motherboard replaced > passes the hwloc-gather-topology test. > After motherboard replacement I renistalled the OS on both, > but it d

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Brice Goglin
Le 28/02/2014 02:48, Ralph Castain a écrit : > Remember, hwloc doesn't actually "sense" hardware - it just parses files in > the /proc area. So if something is garbled in those files, hwloc will report > errors. Doesn't mean anything is wrong with the hardware at all. For the record, that's not

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Brice Goglin
Hello Gus, I'll need the tarball generated by gather-topology on node14 to debug this. node15 doesn't have any issue. We've seen issues on AMD machines because of buggy BIOS reporting incompatible Socket and NUMA info. If node14 doesn't have the same BIOS version as other nodes, that could explain

Re: [OMPI users] "bind-to l3chace" with r30643 in ticket #4240 dosen't work

2014-02-12 Thread Brice Goglin
Is there anything we could do in hwloc to improve this? (I don't even know the exact piece of code you are refering to) Brice Le 12/02/2014 02:46, Ralph Castain a écrit : > Okay, I fixed it. Keep getting caught by a very, very unfortunate design flaw > in hwloc that forces you to treat cache's a

Re: [OMPI users] "-bind-to numa" of openmpi-1.7.4rc1 dosen't work for our magny cours based 32 core node

2013-12-20 Thread Brice Goglin
I don't think there's any such difference. Also, all these NUMA architectures are reported the same by hwloc, and therefore used the same in Open MPI. And yes, L3 and NUMA are topologically-identical on AMD Magny-Cours (and most recent AMD and Intel platforms). Brice Le 20/12/2013 11:33, tmish

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-18 Thread Brice Goglin
hwloc-ps (and lstopo --top) are better at showing process binding but they lack a nice pseudographical interface with dynamic refresh. htop uses hwloc internally iirc, so there's hope we'll have everything needed in htop one day ;) Brice Dave Love a écrit : >John Hearns writes: > >> 'Htop' i

Re: [OMPI users] Mixing Linux's CPU-shielding with mpirun's bind-to-core

2013-08-18 Thread Brice Goglin
Le 18/08/2013 14:51, Siddhartha Jana a écrit : > > If all the above works and does not return errors (you should > check that > your application's PID is in /dev/cpuset/socket0/tasks while running), > bind-to-core won't clash with it, at least when using a OMPI that uses > hwloc

Re: [OMPI users] Mixing Linux's CPU-shielding with mpirun's bind-to-core

2013-08-18 Thread Brice Goglin
Le 18/08/2013 05:34, Siddhartha Jana a écrit : > Hi, > > My requirement: > 1. Avoid the OS from scheduling tasks on cores 0-7 allocated to my > process. > 2. Avoid rescheduling of processes to other cores. > > My solution: I use Linux's CPU-shielding. > [ Man page: > http://www.kernel.org/doc/m

Re: [OMPI users] knem/openmpi performance?

2013-07-18 Thread Brice Goglin
Le 18/07/2013 13:23, Dave Love a écrit : > Mark Dixon writes: > >> On Mon, 15 Jul 2013, Elken, Tom wrote: >> ... >>> Hope these anecdotes are relevant to Open MPI users considering knem. >> ... >> >> Brilliantly useful, thanks! It certainly looks like it may be greatly >> significant for some appl

Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Brice Goglin
ce driver level. > > Anyways, as long as the memory performance difference is a the levels > you mentioned then there is no "big" issue. Most likely the device > driver get space from the same numa domain that of the socket the HCA > is attached to. > > Thanks for

Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Brice Goglin
On a dual E5 2650 machine with FDR cards, I see the IMB Pingpong throughput drop from 6000 to 5700MB/s when the memory isn't allocated on the right socket (and latency increases from 0.8 to 1.4us). Of course that's pingpong only, things will be worse on a memory-overloaded machine. But I don't expe

Re: [OMPI users] Core ids not coming properly

2013-02-15 Thread Brice Goglin
IntelMPI binds processes by default, while OMPI doesn't. What's your mpiexec/mpirun command-line? Brice Le 15/02/2013 17:34, Kranthi Kumar a écrit : > Hello Sir > > Here below is the code which I wrote using hwloc for getting the > bindings of the processes. > I tested this code on SDSC Gordon

Re: [OMPI users] how to find the binding of each rank on the local machine

2013-02-10 Thread Brice Goglin
t least. Brice Le 10/02/2013 22:47, Ralph Castain a écrit : > I honestly have no idea what you mean. Are you talking about inside an MPI > application? Do you mean from inside the MPI layer? Inside ORTE? Inside an > ORTE daemon? > > > On Feb 10, 2013, at 1:41 PM, Brice Gogli

Re: [OMPI users] how to find the binding of each rank on the local machine

2013-02-10 Thread Brice Goglin
Affinity_str" for details (assuming you included the OMPI man > pages in your MANPATH), or look at it online at > > http://www.open-mpi.org/doc/v1.6/man3/OMPI_Affinity_str.3.php > > Remember, you have to configure with --enable-mpi-ext in order to enable the > extensions.

Re: [OMPI users] how to find the binding of each rank on the local machine

2013-02-10 Thread Brice Goglin
I've been talking with Kranthi offline, he wants to use locality info inside OMPI. He needs the binding info from *inside* MPI. From 10 thousands feet, it looks like communicator->rank[X]->locality_info as a hwloc object or as a hwloc bitmap. Brice Le 10/02/2013 06:07, Ralph Castain a écrit : >

Re: [OMPI users] Error in configuring hwloc(hardware locality) on Linux on System Z

2012-12-13 Thread Brice Goglin
Le 13/12/2012 10:45, Shikha Maheshwari a écrit : > Hi, > > We are trying to build 'hwloc 1.4.2' on Linux on System Z. To build hwloc Hello, If you are really talking about hwloc, you should contact this mailing list Hardware locality user list (Open MPI and hwloc are different software, even

Re: [OMPI users] running "openmpi" with "knem"

2012-12-01 Thread Brice Goglin
Le 01/12/2012 12:45, Leta Melkamu a écrit : > Hello there, > > I have some doubts on the use of knem with openmpi, everything works fine. > However, it is a bit not clear on the usage of knem flags while > running my open-mpi program. > Something like --mca btl_sm_knem_dma_min 4860 is enough or I

Re: [OMPI users] mpi_leave_pinned is dangerous

2012-11-08 Thread Brice Goglin
My understanding of the upstreaming failure was more like: * Linus was going to be OK * Some perf (or trace?) guys came late and said "oh your code should be integrated into our more general stuff" but they didn't do it, and basically vetoed anything that didn't do what they said Brice Le 08/11

Re: [OMPI users] How is hwloc used by OpenMPI

2012-11-07 Thread Brice Goglin
Le 07/11/2012 21:26, Jeff Squyres a écrit : > On Nov 7, 2012, at 1:33 PM, Blosch, Edwin L wrote: > >> I see hwloc is a subproject hosted under OpenMPI but, in reading the >> documentation, I was unable to figure out if hwloc is a module within >> OpenMPI, or if some of the code base is borrowed i

Re: [OMPI users] Best way to map MPI processes to sockets?

2012-11-07 Thread Brice Goglin
What processor and kernel is this? (see /proc/cpuinfo, or run "lstopo -v" and look for attributes on the Socket line) You're hwloc output looks like an Intel Xeon Westmere-EX (E7-48xx or E7-88xx). The likwid output is likely wrong (maybe confused by the fact that hardware threads are disabled). Br

Re: [OMPI users] [omx-devel] Open-mx issue with ompi 1.6.1

2012-09-12 Thread Brice Goglin
em is that the MTL component calls ompi_common_mx_initialize() only once in component_init() but it calls finalize() twice: once in component_close() and once in ompi_mtl_mx_finalize(). The attached patch seems to work. Signed-off-by: Brice Goglin Brice diff --git a/ompi/mca/mtl/mx/mtl_mx.c b/o

Re: [OMPI users] [omx-devel] Open-mx issue with ompi 1.6.1

2012-09-12 Thread Brice Goglin
> open-mx-devel to CC when you reply. >> >> Brice >> >> >> Le 07/09/2012 00:10, Brice Goglin a écrit : >>> Hello Doug, >>> >>> Did you use the same Open-MX version when it worked fine? Same kernel >>> too? >>> Any chance

Re: [OMPI users] [omx-devel] Open-mx issue with ompi 1.6.1

2012-09-10 Thread Brice Goglin
I replied a couple days ago (with OMPI users in CC) but got an error last night: Action: failed Status: 5.0.0 (permanent failure) Diagnostic-Code: smtp; 5.4.7 - Delivery expired (message too old) 'timeout' (delivery attempts: 0) I resent the mail this morning, it looks like it wasn't delivered

Re: [OMPI users] wrong core binding by openmpi-1.5.5

2012-04-12 Thread Brice Goglin
:fork binding child > [[43552,1],0] to cpus 000f > > Regards, > Tetsuya Mishima > >> Here's a better patch. Still only compile tested :) >> Brice >> >> >> Le 11/04/2012 10:36, Brice Goglin a écrit : >> >> A quick look at the code seems to conf

Re: [OMPI users] wrong core binding by openmpi-1.5.5

2012-04-11 Thread Brice Goglin
Here's a better patch. Still only compile tested :) Brice Le 11/04/2012 10:36, Brice Goglin a écrit : > A quick look at the code seems to confirm my feeling. get/set_module() > callbacks manipulate arrays of logical indexes, and they do not convert > them back to physical indexes

Re: [OMPI users] wrong core binding by openmpi-1.5.5

2012-04-11 Thread Brice Goglin
A quick look at the code seems to confirm my feeling. get/set_module() callbacks manipulate arrays of logical indexes, and they do not convert them back to physical indexes before binding. Here's a quick patch that may help. Only compile tested... Brice Le 11/04/2012 09:49, Brice Gog

Re: [OMPI users] wrong core binding by openmpi-1.5.5

2012-04-11 Thread Brice Goglin
Le 11/04/2012 09:06, tmish...@jcity.maeda.co.jp a écrit : > Hi, Brice. > > I installed the latest hwloc-1.4.1. > Here is the output of lstopo -p. > > [root@node03 bin]# ./lstopo -p > Machine (126GB) > Socket P#0 (32GB) > NUMANode P#0 (16GB) + L3 (5118KB) > L2 (512KB) + L1 (64KB) + Core

Re: [OMPI users] wrong core binding by openmpi-1.5.5

2012-04-11 Thread Brice Goglin
Can you send the output of lstopo -p ? (you'll have to install hwloc) Brice tmish...@jcity.maeda.co.jp a écrit : Hi, I updated openmpi from version 1.5.4 to 1.5.5. Then, an execution speed of my application becomes quite slower than before, due to wrong core bindings. As far as I checked, it s

Re: [OMPI users] core binding failure on Interlagos (and possibly Magny-Cours)

2012-01-31 Thread Brice Goglin
Le 31/01/2012 19:02, Dave Love a écrit : >> FWIW, the Linux kernel (at least up to 3.2) still reports wrong L2 and >> L1i cache information on AMD Bulldozer. Kernel bug reported at >> https://bugzilla.kernel.org/show_bug.cgi?id=42607 > I assume that isn't relevant for open-mpi, just other things.

Re: [OMPI users] core binding failure on Interlagos (and possibly Magny-Cours)

2012-01-31 Thread Brice Goglin
Le 31/01/2012 14:24, Jeff Squyres a écrit : > On Jan 31, 2012, at 6:18 AM, Dave Love wrote: > >> Core binding is broken on Interlagos with open-mpi 1.5.4. I guess it >> also bites on Magny-Cours, but all our systems are currently busy and I >> can't check. >> >> It does work, at least basically, i

Re: [OMPI users] [hwloc-devel] EXTERNAL: Re: Unresolved reference 'mbind' and 'get_mempolicy'

2011-12-07 Thread Brice Goglin
Le 07/12/2011 23:00, Rayson Ho a écrit : > We are using hwloc-1.2.2 for topology binding in Open Grid > Scheduler/Grid Engine 2011.11, and a user is encountering similar > issues: > > http://gridengine.org/pipermail/users/2011-December/002126.html > > In Open MPI, there is the configure switch "--w

Re: [OMPI users] EXTERNAL: Re: Unresolved reference 'mbind' and 'get_mempolicy'

2011-09-29 Thread Brice Goglin
Le 28/09/2011 23:02, Blosch, Edwin L a écrit : > Jeff, > > I've tried it now adding --without-libnuma. Actually that did NOT fix the > problem, so I can send you the full output from configure if you want, to > understand why this "hwloc" function is trying to use a function which > appears t

Re: [OMPI users] Unresolved reference 'mbind' and 'get_mempolicy'

2011-09-28 Thread Brice Goglin
Le 28/09/2011 17:55, Blosch, Edwin L a écrit : > > I am getting some undefined references in building OpenMPI 1.5.4 and I > would like to know how to work around it. > > > > The errors look like this: > > > > /scratch1/bloscel/builds/release/openmpi-intel/lib/libmpi.a(topology-linux.o): > In fu

Re: [OMPI users] #cpus/socket

2011-09-13 Thread Brice Goglin
Le 13/09/2011 18:59, Peter Kjellström a écrit : > On Tuesday, September 13, 2011 09:07:32 AM nn3003 wrote: >> Hello ! >> >> I am running wrf model on 4x AMD 6172 which is 12 core CPU. I use OpenMPI >> 1.4.3 and libgomp 4.3.4. I have binaries compiled for shared-memory and >> distributed-memory (O

Re: [OMPI users] openmpi 1.5.4 paffinity with Magny-Cours

2011-09-09 Thread Brice Goglin
Le 09/09/2011 21:03, Kaizaad Bilimorya a écrit : > > We seem to have an issue similar to this thread > > "Bug in openmpi 1.5.4 in paffinity" > http://www.open-mpi.org/community/lists/users/2011/09/17151.php > > Using the following version of hwloc (from EPEL repo - we run CentOS 5.6) > > $ hwloc-in

Re: [OMPI users] Bug in openmpi 1.5.4 in paffinity

2011-09-06 Thread Brice Goglin
ently contains hwloc 1.2.0) > > > On Sep 6, 2011, at 1:43 AM, Brice Goglin wrote: > >> Le 05/09/2011 21:29, Brice Goglin a écrit : >>> Dear Ake, >>> Could you try the attached patch? It's not optimized, but it's probably >>> going in

Re: [OMPI users] Bug in openmpi 1.5.4 in paffinity

2011-09-06 Thread Brice Goglin
Le 05/09/2011 21:29, Brice Goglin a écrit : > Dear Ake, > Could you try the attached patch? It's not optimized, but it's probably > going in the right direction. > (and don't forget to remove the above comment-out if you tried it). Actually, now that I've seen y

Re: [OMPI users] Bug in openmpi 1.5.4 in paffinity

2011-09-05 Thread Brice Goglin
Le 04/09/2011 23:30, Brice Goglin a écrit : > Le 04/09/2011 22:35, Ake Sandgren a écrit : >> On Sun, 2011-09-04 at 22:13 +0200, Brice Goglin wrote: >>> Hello, >>> >>> Could you log again on this node (with same cgroups enabled), run >>> hwlo

Re: [OMPI users] Bug in openmpi 1.5.4 in paffinity

2011-09-04 Thread Brice Goglin
Le 04/09/2011 22:35, Ake Sandgren a écrit : > On Sun, 2011-09-04 at 22:13 +0200, Brice Goglin wrote: >> Hello, >> >> Could you log again on this node (with same cgroups enabled), run >> hwloc-gather-topology >> and send the resulting .output and .tar.bz2? &

Re: [OMPI users] Bug in openmpi 1.5.4 in paffinity

2011-09-04 Thread Brice Goglin
Hello, Could you log again on this node (with same cgroups enabled), run hwloc-gather-topology and send the resulting .output and .tar.bz2? Send them to the hwloc-devel or open a ticket on https://svn.open-mpi.org/trac/hwloc (or send them to me in private if you don't want to subscribe). th

Re: [OMPI users] Program hangs when using OpenMPI and CUDA

2011-06-05 Thread Brice Goglin
Le 05/06/2011 00:15, Fengguang Song a écrit : > Hi, > > I'm confronting a problem when using OpenMPI 1.5.1 on a GPU cluster. My > program uses MPI to exchange data > between nodes, and uses cudaMemcpyAsync to exchange data between Host and GPU > devices within a node. > When the MPI message size

Re: [OMPI users] anybody tried OMPI with gpudirect?

2011-03-09 Thread Brice Goglin
adjusting the openib BTL flags. > > -- mca btl_openib_flags 304 > > Rolf > > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Brice Goglin > Sent: Monday, February 28, 2011 11:16 AM > To: us...@open-mpi.o

Re: [OMPI users] anybody tried OMPI with gpudirect?

2011-02-28 Thread Brice Goglin
Le 28/02/2011 19:49, Rolf vandeVaart a écrit : > For the GPU Direct to work with Infiniband, you need to get some updated OFED > bits from your Infiniband vendor. > > In terms of checking the driver updates, you can do a grep on the string > get_driver_pages in the file/proc/kallsyms. If it is

Re: [OMPI users] anybody tried OMPI with gpudirect?

2011-02-28 Thread Brice Goglin
Le 28/02/2011 17:30, Rolf vandeVaart a écrit : > Hi Brice: > Yes, I have tired OMPI 1.5 with gpudirect and it worked for me. You > definitely need the patch or you will see the behavior just as you described, > a hang. One thing you could try is disabling the large message RDMA in OMPI > and se

[OMPI users] anybody tried OMPI with gpudirect?

2011-02-28 Thread Brice Goglin
there. Has anybody ever looked at this? FWIW, we're using OMPI 1.5, OFED 1.5.2, Intel MPI 4.0.0.28 and SLES11 w/ and w/o the gpudirect patch. Thanks Brice Goglin