Re: [OMPI users] Hang in MPI_Abort

2016-06-30 Thread Ralph Castain
Rats - and this only happens on arm32? > On Jun 30, 2016, at 1:56 PM, Orion Poplawski wrote: > > On 06/30/2016 02:55 PM, Orion Poplawski wrote: >> valgrind output: >> >> $ valgrind mpiexec -n 6 ./testphdf5 >> ==8518== Memcheck, a memory error detector >> ==8518== Copyright

Re: [OMPI users] Hang in MPI_Abort

2016-06-30 Thread Orion Poplawski
On 06/30/2016 02:55 PM, Orion Poplawski wrote: > valgrind output: > > $ valgrind mpiexec -n 6 ./testphdf5 > ==8518== Memcheck, a memory error detector > ==8518== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. > ==8518== Using Valgrind-3.11.0 and LibVEX; rerun with -h for

Re: [OMPI users] Hang in MPI_Abort

2016-06-30 Thread Orion Poplawski
valgrind output: $ valgrind mpiexec -n 6 ./testphdf5 ==8518== Memcheck, a memory error detector ==8518== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==8518== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==8518== Command: mpiexec -n 6 ./testphdf5 ==8518==

Re: [OMPI users] Hang in MPI_Abort

2016-06-30 Thread Ralph Castain
So the application procs are all gone, but mpiexec isn’t exiting? I’d suggest running valgrind, given the corruption. > On Jun 30, 2016, at 10:21 AM, Orion Poplawski wrote: > > On 06/30/2016 10:33 AM, Orion Poplawski wrote: >> No, just mpiexec is running. single node.

Re: [OMPI users] Hang in MPI_Abort

2016-06-30 Thread Orion Poplawski
On 06/30/2016 10:33 AM, Orion Poplawski wrote: > No, just mpiexec is running. single node. Only see it when the test is > executed with "make check", not seeing it if I just run mpiexec -n 6 > ./testphdf5 by hand. Hmm, now I'm seeing it running mpiexec by hand. Trying to check it via gdb

Re: [OMPI users] Hang in MPI_Abort

2016-06-30 Thread Orion Poplawski
No, just mpiexec is running. single node. Only see it when the test is executed with "make check", not seeing it if I just run mpiexec -n 6 ./testphdf5 by hand. On 06/30/2016 09:58 AM, Ralph Castain wrote: > Are the procs still alive? Is this on a single node? > >> On Jun 30, 2016, at 8:49 AM,

Re: [OMPI users] The ompi/mca/cool/sm will not be used on multi-nodes?

2016-06-30 Thread Jeff Squyres (jsquyres)
I actually wouldn't advise ml. It *was* being developed as a joint project between ORNL and Mellanox. I think that code eventually grew into what the "hcoll" Mellanox library currently is. As such, ml reflects kind of a middle point before hcoll became hardened into a real product. It has

Re: [OMPI users] Hang in MPI_Abort

2016-06-30 Thread Ralph Castain
Are the procs still alive? Is this on a single node? > On Jun 30, 2016, at 8:49 AM, Orion Poplawski wrote: > > I'm seeing hangs when MPI_Abort is called. This is with openmpi 1.10.3. e.g: > > program output: > > Testing -- big dataset test (bigdset) > Proc 3: ***

Re: [OMPI users] Hang in MPI_Abort

2016-06-30 Thread Orion Poplawski
On 06/30/2016 09:49 AM, Orion Poplawski wrote: > I'm seeing hangs when MPI_Abort is called. This is with openmpi 1.10.3. e.g: I'll also note that I'm seeing this on 32-bit arm, but not i686 or x86_64. -- Orion Poplawski Technical Manager 303-415-9701 x222 NWRA,

[OMPI users] Hang in MPI_Abort

2016-06-30 Thread Orion Poplawski
I'm seeing hangs when MPI_Abort is called. This is with openmpi 1.10.3. e.g: program output: Testing -- big dataset test (bigdset) Proc 3: *** Parallel ERROR *** VRFY (sizeof(MPI_Offset)>4) failed at line 479 in ../../testpar/t_mdset.c aborting MPI processes Testing -- big dataset test

Re: [OMPI users] The ompi/mca/cool/sm will not be used on multi-nodes?

2016-06-30 Thread Saliya Ekanayake
OK, that's good. I'll try that. So, is *ml* something not being developed now? Any documentation on this component? Thank you, Saliya On Thu, Jun 30, 2016 at 11:01 AM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > you might want to give coll/ml a try > mpirun --mca

Re: [OMPI users] The ompi/mca/cool/sm will not be used on multi-nodes?

2016-06-30 Thread Gilles Gouaillardet
you might want to give coll/ml a try mpirun --mca coll_ml_priority 100 ... Cheers, Gilles On Thursday, June 30, 2016, Saliya Ekanayake wrote: > Thank you, Gilles. The reason for digging into intra-node optimizations is > that we've implemented several machine learning

Re: [OMPI users] The ompi/mca/cool/sm will not be used on multi-nodes?

2016-06-30 Thread Saliya Ekanayake
Thank you, Gilles. The reason for digging into intra-node optimizations is that we've implemented several machine learning applications in OpenMPI (Java binding), but found collective communication to be a bottleneck, especially when the number of procs per node is high. I've implemented a shared

Re: [OMPI users] The ompi/mca/cool/sm will not be used on multi-nodes?

2016-06-30 Thread Gilles Gouaillardet
currently, coll/tuned is not topology aware. this is something interesting, and everyone is invited to contribute. coll/ml is topology aware, but it is kind of unmaintained now. send/recv involves two abstraction layer pml, and then the interconnect transport. typically, pml/ob1 is used, and it

Re: [OMPI users] The ompi/mca/cool/sm will not be used on multi-nodes?

2016-06-30 Thread Saliya Ekanayake
OK, I am beginning to see how it works now. One question I still have is, in the case of a mult-node communicator it seems coll/tuned (or something not coll/sm) well be the one used, so do they do any optimizations to reduce communication within a node? Also where can I find the p2p send recv

Re: [OMPI users] The ompi/mca/cool/sm will not be used on multi-nodes?

2016-06-30 Thread Gilles Gouaillardet
the Bcast in coll/sm coll modules have priority (see ompi_info --all) for a given function (e,g. bcast) the module which implements it and has the highest priority is used. note a module can disqualify itself on a given communicator (e.g. coll/sm on I ter node communucator). by default,

Re: [OMPI users] The ompi/mca/cool/sm will not be used on multi-nodes?

2016-06-30 Thread Saliya Ekanayake
Thank you, Gilles. What is the bcast I should look for? In general, how do I know which module was used to for which communication - can I print this info? On Jun 30, 2016 3:19 AM, "Gilles Gouaillardet" wrote: > 1) is correct. coll/sm is disqualified if the communicator is an

Re: [OMPI users] The ompi/mca/cool/sm will not be used on multi-nodes?

2016-06-30 Thread Gilles Gouaillardet
1) is correct. coll/sm is disqualified if the communicator is an inter communicator or the communicator spans on several nodes. you can have a look at the source code, and you will not that bcast does not use send/recv. instead, it uses a shared memory, so hopefully, it is faster than other

[OMPI users] The ompi/mca/cool/sm will not be used on multi-nodes?

2016-06-30 Thread Saliya Ekanayake
Hi, Looking at the *ompi/mca/coll/sm/coll_sm_module.c* it seems this module will be used only if the calling communicator solely groups processes within a node. I've got two questions here. 1. So is my understanding correct that for something like MPI_COMM_WORLD where world is multiple processes