Re: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, basesmuma, p2p

2022-11-07 Thread Ben Menadue via devel
Hi, We see this on our cluster as well — we traced it to because Python loads shared library extensions using RTLD_LOCAL. The Python module (mpi4py?) has a dependency on libmpi.so, which in turn has a dependency on libhcoll.so. So the Python module is being loaded with RTLD_LOCAL, anything

Re: [OMPI devel] v5.0 equivalent of --map-by numa

2021-11-11 Thread Ben Menadue via devel
memory). See https://github.com/open-mpi/ompi/issues/8170 and https://github.com/openpmix/prrte/pull/1141 Brice Le 11/11/2021 à 05:33, Ben Menadue via devel a écrit : > Hi, > > Quick question: what's the equivalent of "--map-by numa" for the new > PRRTE-based r

[OMPI devel] v5.0 equivalent of --map-by numa

2021-11-10 Thread Ben Menadue via devel
Hi, Quick question: what's the equivalent of "--map-by numa" for the new PRRTE-based runtime for v5.0? I can see "package" and "l3cache" in the help, which are close, but don't quite match "numa" for our system. In more detail... We have dual-socket CLX- and SKL-based nodes with sub-NUMA

[OMPI devel] GitHub v4.0.2 tag is broken

2020-04-01 Thread Ben Menadue via devel
Hi, The v4.0.2 tag in GitHub is broken at the moment -- trying to go to it just takes you to the v4.0.2 _branch_, which looks to be a separate, much more recent fork from master: https://github.com/open-mpi/ompi/tree/v4.0.2 Cheers, Ben

Re: [OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

2018-11-01 Thread Ben Menadue
Nov 2018, at 12:09 pm, Ben Menadue <ben.mena...@nci.org.au> wrote:HI Gilles,On 2 Nov 2018, at 11:03 am, Gilles Gouaillardet <gil...@rist.or.jp> wrote:I noted the stack traces refers opal_cuda_memcpy(). Is this issue specific to CUDA environments ?No, this is just on normal CPU-only no

Re: [OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

2018-11-01 Thread Ben Menadue
HI Gilles, > On 2 Nov 2018, at 11:03 am, Gilles Gouaillardet wrote: > I noted the stack traces refers opal_cuda_memcpy(). Is this issue specific to > CUDA environments ? No, this is just on normal CPU-only nodes. But memcpy always goes through opal_cuda_memcpy when CUDA support is enabled,

[OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

2018-11-01 Thread Ben Menadue
Hi, One of our users is reporting an issue using MPI_Allgatherv with a large derived datatype — it segfaults inside OpenMPI. Using a debug build of OpenMPI 3.1.2 produces a ton of messages like this before the segfault: [r3816:50921] ../../../../../opal/datatype/opal_datatype_pack.h:53

Re: [OMPI devel] Removing the oob/ud component

2018-06-19 Thread Ben Menadue
Hi Jeff, What’s the replacement that it should use instead? I’m pretty sure oob/ud is being picked by default on our IB cluster. Or is oob/tcp good enough? Cheers, Ben > On 20 Jun 2018, at 5:20 am, Jeff Squyres (jsquyres) via devel > wrote: > > We talked about this on the webex today, but

Re: [OMPI devel] [OMPI users] 3.x - hang in MPI_Comm_disconnect

2018-05-21 Thread Ben Menadue
(since the problem is different in > the various releases) in the next few days that points to the problems. > > Comm_spawn is okay, FWIW > > Ralph > > >> On May 21, 2018, at 8:00 PM, Ben Menadue <ben.mena...@nci.org.au >> <mailto:ben.mena...@nci.org.au>&

Re: [OMPI devel] [OMPI users] 3.x - hang in MPI_Comm_disconnect

2018-05-21 Thread Ben Menadue
, and pmix_progress_threads). That said, I’m not sure why get_tracker is reporting 32 procs — there’s only 16 running here (i.e. 1 original + 15 spawned). Or should I post this over in the PMIx list instead? Cheers, Ben > On 17 May 2018, at 9:59 am, Ben Menadue <ben.mena...@nci.org.au> wrote

[OMPI devel] Map by socket broken in 3.0.0?

2017-10-02 Thread Ben Menadue
Hi, I having trouble using map by socket on remote nodes. Running on the same node as mpirun works fine (except for that spurious debugging line): $ mpirun -H localhost:16 -map-by ppr:2:socket:PE=4 -display-map /bin/true [raijin7:22248] SETTING BINDING TO CORE Data for JOB [11140,1] offset 0

[OMPI devel] 3.0.0 - extraneous "DONE" when mapping by core

2017-09-18 Thread Ben Menadue
Hi, I’m seeing an extraneous “DONE” message being printed with OpenMPI 3.0.0 when mapping by core: [bjm900@raijin7 pt2pt]$ mpirun -np 2 ./osu_bw > /dev/null [bjm900@raijin7 pt2pt]$ mpirun -map-by core -np 2 ./osu_bw > /dev/null [raijin7:14376] DONE This patch gets rid of the offending line —

Re: [OMPI devel] Binding with --oversubscribe in 2.0.0

2016-08-25 Thread Ben Menadue
elcome to pull down the patch and locally apply it if it would help. Ralph > On Aug 24, 2016, at 5:29 PM, r...@open-mpi.org wrote: > > Hmmm...bet I know why. Let me poke a bit. > >> On Aug 24, 2016, at 5:18 PM, Ben Menadue <ben.mena...@nci.org.au> wrote: >>

Re: [OMPI devel] Binding with --oversubscribe in 2.0.0

2016-08-24 Thread Ben Menadue
could pull the patch in advance if it is holding you up. > > >> On Aug 23, 2016, at 11:46 PM, Ben Menadue <ben.mena...@nci.org.au> wrote: >> >> Hi, >> >> One of our users has noticed that binding is disabled in 2.0.0 when >> --oversubscribe is pa

[OMPI devel] Binding with --oversubscribe in 2.0.0

2016-08-24 Thread Ben Menadue
Hi, One of our users has noticed that binding is disabled in 2.0.0 when --oversubscribe is passed, which is hurting their performance, likely through migrations between sockets. It looks to be because of 294793c (PR#1228). They need to use --oversubscribe as for some reason the developers

Re: [OMPI devel] mpif.h on Intel build when run with OMPI_FC=gfortran

2016-03-03 Thread Ben Menadue
, but that was before my time. Cheers, Ben From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Dave Turner Sent: Friday, 4 March 2016 3:28 PM To: Ben Menadue <ben.mena...@nci.org.au> Cc: Open MPI Developers <de...@open-mpi.org> Subject: Re: [OMPI devel] mpif.h on Intel bu

Re: [OMPI devel] mpif.h on Intel build when run with OMPI_FC=gfortran

2016-03-03 Thread Ben Menadue
Hi Dave, The issue is the way MPI_Sizeof is handled; it's implemented as a series of interfaces that map the MPI_Sizeof call to the right function in the library. I suspect this is needed because that function doesn't take a datatype argument and instead infers this from the argument types

[OMPI devel] XRC Support

2015-07-08 Thread Ben Menadue
Hi, I just finished building 1.8.6 and master on our cluster and noticed that for both, XRC support wasn't being detected because it didn't detect the IBV_SRQT_XRC declaration: checking whether IBV_SRQT_XRC is declared... (cached) no ... checking if ConnectX XRC support