Re: [OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

2018-11-01 Thread Ben Menadue
Nov 2018, at 12:09 pm, Ben Menadue <ben.mena...@nci.org.au> wrote:HI Gilles,On 2 Nov 2018, at 11:03 am, Gilles Gouaillardet <gil...@rist.or.jp> wrote:I noted the stack traces refers opal_cuda_memcpy(). Is this issue specific to CUDA environments ?No, this is just on normal CPU-only no

Re: [OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

2018-11-01 Thread Ben Menadue
HI Gilles, > On 2 Nov 2018, at 11:03 am, Gilles Gouaillardet wrote: > I noted the stack traces refers opal_cuda_memcpy(). Is this issue specific to > CUDA environments ? No, this is just on normal CPU-only nodes. But memcpy always goes through opal_cuda_memcpy when CUDA support is enabled,

[OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

2018-11-01 Thread Ben Menadue
Hi, One of our users is reporting an issue using MPI_Allgatherv with a large derived datatype — it segfaults inside OpenMPI. Using a debug build of OpenMPI 3.1.2 produces a ton of messages like this before the segfault: [r3816:50921] ../../../../../opal/datatype/opal_datatype_pack.h:53

Re: [OMPI devel] Removing the oob/ud component

2018-06-19 Thread Ben Menadue
Hi Jeff, What’s the replacement that it should use instead? I’m pretty sure oob/ud is being picked by default on our IB cluster. Or is oob/tcp good enough? Cheers, Ben > On 20 Jun 2018, at 5:20 am, Jeff Squyres (jsquyres) via devel > wrote: > > We talked about this on the webex today, but

Re: [OMPI devel] [OMPI users] 3.x - hang in MPI_Comm_disconnect

2018-05-21 Thread Ben Menadue
(since the problem is different in > the various releases) in the next few days that points to the problems. > > Comm_spawn is okay, FWIW > > Ralph > > >> On May 21, 2018, at 8:00 PM, Ben Menadue <ben.mena...@nci.org.au >> <mailto:ben.mena...@nci.org.au>&

Re: [OMPI devel] [OMPI users] 3.x - hang in MPI_Comm_disconnect

2018-05-21 Thread Ben Menadue
, and pmix_progress_threads). That said, I’m not sure why get_tracker is reporting 32 procs — there’s only 16 running here (i.e. 1 original + 15 spawned). Or should I post this over in the PMIx list instead? Cheers, Ben > On 17 May 2018, at 9:59 am, Ben Menadue <ben.mena...@nci.org.au> wrote

[OMPI devel] Map by socket broken in 3.0.0?

2017-10-02 Thread Ben Menadue
Hi, I having trouble using map by socket on remote nodes. Running on the same node as mpirun works fine (except for that spurious debugging line): $ mpirun -H localhost:16 -map-by ppr:2:socket:PE=4 -display-map /bin/true [raijin7:22248] SETTING BINDING TO CORE Data for JOB [11140,1] offset 0

[OMPI devel] 3.0.0 - extraneous "DONE" when mapping by core

2017-09-18 Thread Ben Menadue
Hi, I’m seeing an extraneous “DONE” message being printed with OpenMPI 3.0.0 when mapping by core: [bjm900@raijin7 pt2pt]$ mpirun -np 2 ./osu_bw > /dev/null [bjm900@raijin7 pt2pt]$ mpirun -map-by core -np 2 ./osu_bw > /dev/null [raijin7:14376] DONE This patch gets rid of the offending line —

Re: [OMPI devel] Binding with --oversubscribe in 2.0.0

2016-08-25 Thread Ben Menadue
elcome to pull down the patch and locally apply it if it would help. Ralph > On Aug 24, 2016, at 5:29 PM, r...@open-mpi.org wrote: > > Hmmm...bet I know why. Let me poke a bit. > >> On Aug 24, 2016, at 5:18 PM, Ben Menadue <ben.mena...@nci.org.au> wrote: >>

Re: [OMPI devel] Binding with --oversubscribe in 2.0.0

2016-08-24 Thread Ben Menadue
t;. Adding --map-by core:oversubscribe makes this to work, but then doesn't have binding. Cheers, Ben -Original Message- From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Ben Menadue Sent: Thursday, 25 August 2016 9:36 AM To: 'Open MPI Developers' <devel@lists.open-mpi

Re: [OMPI devel] Binding with --oversubscribe in 2.0.0

2016-08-24 Thread Ben Menadue
could pull the patch in advance if it is holding you up. > > >> On Aug 23, 2016, at 11:46 PM, Ben Menadue <ben.mena...@nci.org.au> wrote: >> >> Hi, >> >> One of our users has noticed that binding is disabled in 2.0.0 when >> --oversubscribe is pa

[OMPI devel] Binding with --oversubscribe in 2.0.0

2016-08-24 Thread Ben Menadue
Hi, One of our users has noticed that binding is disabled in 2.0.0 when --oversubscribe is passed, which is hurting their performance, likely through migrations between sockets. It looks to be because of 294793c (PR#1228). They need to use --oversubscribe as for some reason the developers

[OMPI devel] MCA_SPML_CALL call in compiled objects

2016-07-12 Thread Ben Menadue
Hi, Looks like there's a #include missing from oshmem/shmem/fortran/shmem_put_nb_f.c. It's causing MCA_SPML_CALL to show up as an undefined symbol, even though it's a macro (among other things). The #include is in shmem_get_nb_f.c but not ..._put_... Patch against master (0e433ea): $ git diff

Re: [OMPI devel] mpif.h on Intel build when run with OMPI_FC=gfortran

2016-03-03 Thread Ben Menadue
, but that was before my time. Cheers, Ben From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Dave Turner Sent: Friday, 4 March 2016 3:28 PM To: Ben Menadue <ben.mena...@nci.org.au> Cc: Open MPI Developers <de...@open-mpi.org> Subject: Re: [OMPI devel] mpif.h on Intel bu

Re: [OMPI devel] mpif.h on Intel build when run with OMPI_FC=gfortran

2016-03-03 Thread Ben Menadue
Hi Dave, The issue is the way MPI_Sizeof is handled; it's implemented as a series of interfaces that map the MPI_Sizeof call to the right function in the library. I suspect this is needed because that function doesn't take a datatype argument and instead infers this from the argument types

[OMPI devel] XRC Support

2015-07-08 Thread Ben Menadue
Hi, I just finished building 1.8.6 and master on our cluster and noticed that for both, XRC support wasn't being detected because it didn't detect the IBV_SRQT_XRC declaration: checking whether IBV_SRQT_XRC is declared... (cached) no ... checking if ConnectX XRC support