Re: [OMPI users] device failed to appear .. Connection timed out

2016-12-08 Thread Howard Pritchard
Hi Daniele, I bet this psm2 got installed as part of Mpss 3.7. I see something in the readme for that about MPSS install with OFED support. I think if you want to go the route of using the RHEL Open MPI RPMS, you could use the mca-params.conf file approach to disabling the use of psm2. This file

Re: [OMPI users] device failed to appear .. Connection timed out

2016-12-08 Thread Daniele Tartarini
Hi, many thanks for tour reply. I have a S2600IP Intel motherboard. it is a stand alone server and I cannot see any omnipath device and so not such modules. opainfo is not available on my system missing anything? cheers Daniele On 8 December 2016 at 17:55, Cabral, Matias A wrote: > >Anyway, *

Re: [OMPI users] device failed to appear .. Connection timed out

2016-12-08 Thread Daniele Tartarini
Hi Howard, many thanks for your reply: On 8 December 2016 at 17:22, Howard Pritchard wrote: > hello Daniele, > > Could you post the output from ompi_info command? I'm noticing on the > RPMS that came with the rhel7.2 distro on > one of our systems that it was built to support psm2/hfi-1. > > p

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Noam Bernstein
> On Dec 8, 2016, at 6:05 AM, Gilles Gouaillardet > wrote: > > Christof, > > > There is something really odd with this stack trace. > count is zero, and some pointers do not point to valid addresses (!) > > in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that > the sta

Re: [OMPI users] device failed to appear .. Connection timed out

2016-12-08 Thread Cabral, Matias A
>Anyway, /dev/hfi1_0 doesn't exist. Make sure you have the hfi1 module/driver loaded. In addition, please confirm the links are in active state on all the nodes `opainfo` _MAC From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Howard Pritchard Sent: Thursday, December 08, 2016

Re: [OMPI users] device failed to appear .. Connection timed out

2016-12-08 Thread Howard Pritchard
hello Daniele, Could you post the output from ompi_info command? I'm noticing on the RPMS that came with the rhel7.2 distro on one of our systems that it was built to support psm2/hfi-1. Two things, could you try running applications with mpirun --mca pml ob1 (all the rest of your args) and se

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread r...@open-mpi.org
To the best I can determine, mpirun catches SIGTERM just fine and will hit the procs with SIGCONT, followed by SIGTERM and then SIGKILL. It will then wait to see the remote daemons complete after they hit their procs with the same sequence. > On Dec 8, 2016, at 5:18 AM, Christof Koehler > wr

Re: [OMPI users] device failed to appear .. Connection timed out

2016-12-08 Thread r...@open-mpi.org
Sounds like something didn’t quite get configured right, or maybe you have a library installed that isn’t quite setup correctly, or... Regardless, we generally advise building from source to avoid such problems. Is there some reason not to just do so? > On Dec 8, 2016, at 6:16 AM, Daniele Tarta

[OMPI users] MPI+OpenMP core binding redux

2016-12-08 Thread Dave Love
I think there was a suggestion that the SC16 material would explain how to get appropriate core binding for MPI+OpenMP (i.e. OMP_NUM_THREADS cores/process), but it doesn't as far as I can see. Could someone please say how you're supposed to do that in recent versions (without relying on bound DRM

Re: [OMPI users] How to yield CPU more when not computing (was curious behavior during wait for broadcast: 100% cpu)

2016-12-08 Thread Dave Love
Jeff Hammond writes: >> >> >> > Note that MPI implementations may be interested in taking advantage of >> > https://software.intel.com/en-us/blogs/2016/10/06/intel- >> xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait. >> >> Is that really useful if it's KNL-specific and MSR-bas

[OMPI users] device failed to appear .. Connection timed out

2016-12-08 Thread Daniele Tartarini
Hi, I've installed on a Red Hat 7.2 the OpenMPI distributed via Yum: *openmpi-devel.x86_64 1.10.3-3.el7 * any code I try to run (including the mpitests-*) I get the following message with slight variants: * my_machine.171619hfi_wait_for_device: The /dev/hfi1_0 d

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Christof Koehler
Hello again, I am still not sure about breakpoints. But I did a "catch signal" in gdb, gdb's were attached to the two vasp processes and mpirun. When the root rank exits I see in the gdb attaching to it [Thread 0x2b2787df8700 (LWP 2457) exited] [Thread 0x2b277f483180 (LWP 2455) exited] [Inferior

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Christof Koehler
Hello, On Thu, Dec 08, 2016 at 08:05:44PM +0900, Gilles Gouaillardet wrote: > Christof, > > > There is something really odd with this stack trace. > count is zero, and some pointers do not point to valid addresses (!) Yes, I assumed it was interesting :-) Note that the program is compiled with

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Gilles Gouaillardet
Christof, There is something really odd with this stack trace. count is zero, and some pointers do not point to valid addresses (!) in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that the stack has been corrupted inside MPI_Allreduce(), or that you are not using the libr

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Christof Koehler
Hello everybody, I tried it with the nightly and the direct 2.0.2 branch from git which according to the log should contain that patch commit d0b97d7a408b87425ca53523de369da405358ba2 Merge: ac8c019 b9420bb Author: Jeff Squyres Date: Wed Dec 7 18:24:46 2016 -0500 Merge pull request #2528 fr