Re: [OMPI users] OpenMPI building fails on Windows Linux Subsystem(WLS).

2018-09-19 Thread John Hearns via users
Oleg, I can build the latest master branch of OpenMPI in WSL
I can give it a try with 3.1.2 if that is any help to you?

uname -a
Linux Johns-Spectre 4.4.0-17134-Microsoft #285-Microsoft Thu Aug 30
17:31:00 PST 2018 x86_64 x86_64 x86_64 GNU/Linux
apt-get upgrade
apt-get install gfortran
wget https://github.com/open-mpi/ompi/archive/master.zip
cd ompi-master

./autogen.pl
./configure --enable-mpi-cxx

make -j 2

configure returns this:

Open MPI configuration:
---
Version: 4.1.0a1
Build MPI C bindings: yes
Build MPI C++ bindings (deprecated): yes
Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08
MPI Build Java bindings (experimental): no
Build Open SHMEM support: false (no spml)
Debug build: no
Platform file: (none)
Miscellaneous
---
CUDA support: no
HWLOC support: internal
Libevent support: internal
PMIx support: internal
Transports
---
Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): no
Intel TrueScale (PSM): no
Mellanox MXM: no
Open UCX: no
OpenFabrics Libfabric: no
OpenFabrics Verbs: no
Portals4: no
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes
Resource Managers
---
Cray Alps: no
Grid Engine: no
LSF: no
Moab: no
Slurm: yes
ssh/rsh: yes
Torque: no
OMPIO File Systems
---
DDN Infinite Memory Engine: no
Generic Unix FS: yes
Lustre: no
PVFS2/OrangeFS: no


On Wed, 19 Sep 2018 at 17:36, John Hearns  wrote:
>
> Oleg, I have  a Windows 10 system and could help by testing this also.
> But I have to say - it will be quicker just to install VirtualBox and
> a CentOS VM. Or an Ubuntu VM.
> You can then set up a small test network of VMs using the VirtualBox
> HostOnly network for tests of your MPI code.
> On Wed, 19 Sep 2018 at 16:59, Jeff Squyres (jsquyres) via users
>  wrote:
> >
> > I can't say that we've tried to build on WSL; the fact that it fails is 
> > probably not entirely unsurprising.  :-(
> >
> > I looked at your logs, and although I see the compile failure, I don't see 
> > any reason *why* it failed.  Here's the relevant fail from the 
> > tar_openmpi_fail file:
> >
> > -
> > 5523 Making all in mca/filem
> > 5524 make[2]: Entering directory 
> > '/mnt/c/Users/ofcra/dev/openmpi-3.1.2/orte/mca/filem'
> > 5525   GENERATE orte_filem.7
> > 5526   CC   base/filem_base_frame.lo
> > 5527   CC   base/filem_base_select.lo
> > 5528   CC   base/filem_base_receive.lo
> > 5529   CC   base/filem_base_fns.lo
> > 5530 base/filem_base_receive.c: In function 
> > ‘filem_base_process_get_remote_path_cmd’:
> > 5531 base/filem_base_receive.c:250:9: warning: ignoring return value of 
> > ‘getcwd’, declared with attribute warn_unused_result [-Wunused-result]
> > 5532  getcwd(cwd, sizeof(cwd));
> > 5533  ^~~~
> > 5534 base/filem_base_receive.c:251:9: warning: ignoring return value of 
> > ‘asprintf’, declared with attribute warn_unused_result [-Wunused-result]
> > 5535  asprintf(&tmp_name, "%s/%s", cwd, filename);
> > 5536  ^~~
> > 5537 Makefile:1892: recipe for target 'base/filem_base_select.lo' failed
> > 5538 make[2]: *** [base/filem_base_select.lo] Error 1
> > 5539 make[2]: *** Waiting for unfinished jobs
> > 5540 make[2]: Leaving directory 
> > '/mnt/c/Users/ofcra/dev/openmpi-3.1.2/orte/mca/filem'
> > 5541 Makefile:2586: recipe for target 'all-recursive' failed
> > 5542 make[1]: *** [all-recursive] Error 1
> > 5543 make[1]: Leaving directory '/mnt/c/Users/ofcra/dev/openmpi-3.1.2/orte'
> > 5544 Makefile:1897: recipe for target 'all-recursive' failed
> > 5545 make: *** [all-recursive] Error 1
> > -
> >
> > I.e., I see "recipe for target 'base/filem_base_select.lo' failed" -- but 
> > there's no error indicating *why* it failed.  There were 2 warnings when 
> > compiling that file -- but not errors.  That should not have prevented 
> > compilation for that .c file.
> >
> > You then went on to run "make check", but that failed predictably because 
> > "make" had already failed.
> >
> > You might want to run "make V=1" to see if you can get more details about 
> > why orte/mca/filem/base/filem_base_select.c failed to compile properly.
> >
> > It looks like your GitHub clone build failed in exactly the same place.
> >
> > There's something about filem_base_select.c that is failing to compile -- 
> > that's what we need more detail on.
> >
> >
> >
> > > On Sep 18, 2018, at 10:06 AM, Oleg Kmechak  wrote:
> > >
> > > Hello,
> > >
> > > I am student of Physics from University of Warsaw, and new to OpenMPI. 
> > > Currently just trying to compile it from source code(tried both github 
> > > and  tar(3.1.2)).
> > > I am using Windows Linux Subsystem(WLS), Ubuntu.
> > >
> > > uname -a:
> > > >Linux Canopus 4.4.0-17134-Microsoft #285-Microsoft Thu Aug 30 17:31:00 
> > > >PST 2018 x86_64 x86_64 x86_64 

Re: [OMPI users] How do I build 3.1.0 (or later) with mellanox's libraries

2018-09-19 Thread Barrett, Brian via users
Yeah, there’s no good answer here from an “automatically do the right thing” 
point of view.  The reachable:netlink component (which is used for the TCP BTL) 
only works with libnl-3 because libnl-1 is a real pain to deal with if you’re 
trying to parse route behaviors.  It will do the right thing if you’re using 
OpenIB (the other place the libnl-1/libnl-3 thing comes into play) because 
OpenIB runs its configure test before reachable:netlink, but UCX’s tests run 
way later (for reasons that aren’t fixable).

Mellanox should really update everything to use libnl3 so that there’s at least 
hope of getting the right answer (not just in Open MPI, but in general; libnl-1 
is old and not awesome).  In the mean time, I *think* you can work around this 
problem via two paths.  First, which I know will work, is to remove the libnl-3 
devel package.  That’s probably not optimal for obvious reasons.  The second is 
to specify --enable-mca-no-build=reachable-netlink, which will disable the 
component that is preferring libnl-3 and then UCX should be happy.

Hope this helps,

Brian

> On Sep 19, 2018, at 9:12 AM, Jeff Squyres (jsquyres) via users 
>  wrote:
> 
> Alan --
> 
> Sorry for the delay.
> 
> I agree with Gilles: Brian's commit had to do with "reachable" plugins in 
> Open MPI -- they do not appear to be the problem here.
> 
> From the config.log you sent, it looks like configure aborted because you 
> requested UCX support (via --with-ucx) but configure wasn't able to find it.  
> And it looks like it didn't find it because of libnl v1 vs. v3 issues, as you 
> stated.
> 
> I think we're going to have to refer you to Mellanox support on this one.  
> The libnl situation is kind of a nightmare: your entire stack must be 
> compiled for either libnl v1 *or* v3.  If you have both libnl v1 *and* v3 
> appear in a process together, the process will crash before main() even 
> executes.  :-(  This is precisely why we have these warnings in Open MPI's 
> configure.
> 
> 
> 
> 
>> On Sep 14, 2018, at 4:35 PM, Alan Wild  wrote:
>> 
>> As request I've attached the config.log.  I also included the output from 
>> configure itself.
>> 
>> -Alan
>> 
>> On Fri, Sep 14, 2018, 10:20 AM Alan Wild  wrote:
>> I apologize if this has been discussed before but I've been unable to find 
>> discussion on the topic.
>> 
>> I recently went to build 3.1.2 on our cluster only to have the build 
>> completely fail during configure due to issues with libnl versions.
>> 
>> Specifically I was had requested support for  mellanox's libraries (mxm, 
>> hcoll, sharp, etc) which was fine for me in 3.0.0 and 3.0.1.  However it 
>> appears all of those libraries are built with libnl version 1 but the 
>> netlink component is now requiring netlink version 3 and aborts the build if 
>> it finds anything else in LIBS that using version 1.
>> 
>> I don't believe mellanox's is providing releases of these libraries linked 
>> agsinst liblnl version 3 (love to find out I'm wrong on that) at least not 
>> for CentOS 6.9.
>> 
>> According to github, it appears bwbarret's commit a543e7f (from one year ago 
>> today) which was merged into 3.1.0 is responsible.  However I'm having a 
>> hard time believing that openmpi would want to break support for these 
>> libraries or there isn't some other kind of workaround.
>> 
>> I'm on a short timeline to deliver this build of openmpi to my users but I 
>> know they won't accept a build that doesn't support mellanox's libraries.
>> 
>> Hoping there's an easy fix here (short of trying to reverse the commit in my 
>> build) that I'm overlooking here.
>> 
>> Thanks,
>> 
>> -Alan
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI building fails on Windows Linux Subsystem(WLS).

2018-09-19 Thread John Hearns via users
Oleg, I have  a Windows 10 system and could help by testing this also.
But I have to say - it will be quicker just to install VirtualBox and
a CentOS VM. Or an Ubuntu VM.
You can then set up a small test network of VMs using the VirtualBox
HostOnly network for tests of your MPI code.
On Wed, 19 Sep 2018 at 16:59, Jeff Squyres (jsquyres) via users
 wrote:
>
> I can't say that we've tried to build on WSL; the fact that it fails is 
> probably not entirely unsurprising.  :-(
>
> I looked at your logs, and although I see the compile failure, I don't see 
> any reason *why* it failed.  Here's the relevant fail from the 
> tar_openmpi_fail file:
>
> -
> 5523 Making all in mca/filem
> 5524 make[2]: Entering directory 
> '/mnt/c/Users/ofcra/dev/openmpi-3.1.2/orte/mca/filem'
> 5525   GENERATE orte_filem.7
> 5526   CC   base/filem_base_frame.lo
> 5527   CC   base/filem_base_select.lo
> 5528   CC   base/filem_base_receive.lo
> 5529   CC   base/filem_base_fns.lo
> 5530 base/filem_base_receive.c: In function 
> ‘filem_base_process_get_remote_path_cmd’:
> 5531 base/filem_base_receive.c:250:9: warning: ignoring return value of 
> ‘getcwd’, declared with attribute warn_unused_result [-Wunused-result]
> 5532  getcwd(cwd, sizeof(cwd));
> 5533  ^~~~
> 5534 base/filem_base_receive.c:251:9: warning: ignoring return value of 
> ‘asprintf’, declared with attribute warn_unused_result [-Wunused-result]
> 5535  asprintf(&tmp_name, "%s/%s", cwd, filename);
> 5536  ^~~
> 5537 Makefile:1892: recipe for target 'base/filem_base_select.lo' failed
> 5538 make[2]: *** [base/filem_base_select.lo] Error 1
> 5539 make[2]: *** Waiting for unfinished jobs
> 5540 make[2]: Leaving directory 
> '/mnt/c/Users/ofcra/dev/openmpi-3.1.2/orte/mca/filem'
> 5541 Makefile:2586: recipe for target 'all-recursive' failed
> 5542 make[1]: *** [all-recursive] Error 1
> 5543 make[1]: Leaving directory '/mnt/c/Users/ofcra/dev/openmpi-3.1.2/orte'
> 5544 Makefile:1897: recipe for target 'all-recursive' failed
> 5545 make: *** [all-recursive] Error 1
> -
>
> I.e., I see "recipe for target 'base/filem_base_select.lo' failed" -- but 
> there's no error indicating *why* it failed.  There were 2 warnings when 
> compiling that file -- but not errors.  That should not have prevented 
> compilation for that .c file.
>
> You then went on to run "make check", but that failed predictably because 
> "make" had already failed.
>
> You might want to run "make V=1" to see if you can get more details about why 
> orte/mca/filem/base/filem_base_select.c failed to compile properly.
>
> It looks like your GitHub clone build failed in exactly the same place.
>
> There's something about filem_base_select.c that is failing to compile -- 
> that's what we need more detail on.
>
>
>
> > On Sep 18, 2018, at 10:06 AM, Oleg Kmechak  wrote:
> >
> > Hello,
> >
> > I am student of Physics from University of Warsaw, and new to OpenMPI. 
> > Currently just trying to compile it from source code(tried both github and  
> > tar(3.1.2)).
> > I am using Windows Linux Subsystem(WLS), Ubuntu.
> >
> > uname -a:
> > >Linux Canopus 4.4.0-17134-Microsoft #285-Microsoft Thu Aug 30 17:31:00 PST 
> > >2018 x86_64 x86_64 x86_64 GNU/Linux
> >
> > I have done all steps suggested in INSTALL and HACKING files, installed 
> > next tool i proper order: M4(1.4.18), autoconf(2.69), automake(1.15.1), 
> > libtool(2.4.6), flex(2.6.4).
> >
> > Next I enabled AUTOMAKE_JOBS=4 and ran:
> >
> > ./autogen.pl #for source code from git hub
> >
> > Then
> > ./configure --disable-picky --enable-mpi-cxx --without-cma --enable-static
> >
> > I added --without-cma cos I have a lot of warnings about compiling asprintf 
> > function
> >
> > and finally:
> > make -j 4 all #cos I have 4 logical processors
> >
> > And in both versions(from github or  tar(3.1.2)) it fails.
> > Github version error:
> > >../../../../opal/mca/hwloc/hwloc201/hwloc/include/hwloc.h:71:10: fatal 
> > >error: hwloc/bitmap.h: No such file or directory
> >  #include 
> >
> > And tar(3.1.2) version:
> > >libtool:   error: cannot find the library '../../ompi/libmpi.la' or 
> > >unhandled argument '../../ompi/libmpi.la'
> >
> > Please see also full log in attachment
> > Thanks, hope You will help(cos I passed a lot of time on it currently:) )
> >
> >
> > PS: if this is a bug or unimplemented feature(WLS is probably quite 
> > specific platform), should I rise issue on github project?
> >
> >
> > Regards, Oleg Kmechak
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
use

Re: [OMPI users] How do I build 3.1.0 (or later) with mellanox's libraries

2018-09-19 Thread Jeff Squyres (jsquyres) via users
Alan --

Sorry for the delay.

I agree with Gilles: Brian's commit had to do with "reachable" plugins in Open 
MPI -- they do not appear to be the problem here.

>From the config.log you sent, it looks like configure aborted because you 
>requested UCX support (via --with-ucx) but configure wasn't able to find it.  
>And it looks like it didn't find it because of libnl v1 vs. v3 issues, as you 
>stated.

I think we're going to have to refer you to Mellanox support on this one.  The 
libnl situation is kind of a nightmare: your entire stack must be compiled for 
either libnl v1 *or* v3.  If you have both libnl v1 *and* v3 appear in a 
process together, the process will crash before main() even executes.  :-(  
This is precisely why we have these warnings in Open MPI's configure.




> On Sep 14, 2018, at 4:35 PM, Alan Wild  wrote:
> 
> As request I've attached the config.log.  I also included the output from 
> configure itself.
> 
> -Alan
> 
> On Fri, Sep 14, 2018, 10:20 AM Alan Wild  wrote:
> I apologize if this has been discussed before but I've been unable to find 
> discussion on the topic.
> 
> I recently went to build 3.1.2 on our cluster only to have the build 
> completely fail during configure due to issues with libnl versions.
> 
> Specifically I was had requested support for  mellanox's libraries (mxm, 
> hcoll, sharp, etc) which was fine for me in 3.0.0 and 3.0.1.  However it 
> appears all of those libraries are built with libnl version 1 but the netlink 
> component is now requiring netlink version 3 and aborts the build if it finds 
> anything else in LIBS that using version 1.
> 
> I don't believe mellanox's is providing releases of these libraries linked 
> agsinst liblnl version 3 (love to find out I'm wrong on that) at least not 
> for CentOS 6.9.
> 
> According to github, it appears bwbarret's commit a543e7f (from one year ago 
> today) which was merged into 3.1.0 is responsible.  However I'm having a hard 
> time believing that openmpi would want to break support for these libraries 
> or there isn't some other kind of workaround.
> 
> I'm on a short timeline to deliver this build of openmpi to my users but I 
> know they won't accept a build that doesn't support mellanox's libraries.
> 
> Hoping there's an easy fix here (short of trying to reverse the commit in my 
> build) that I'm overlooking here.
> 
> Thanks,
> 
> -Alan
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] OpenMPI building fails on Windows Linux Subsystem(WLS).

2018-09-19 Thread Jeff Squyres (jsquyres) via users
I can't say that we've tried to build on WSL; the fact that it fails is 
probably not entirely unsurprising.  :-(

I looked at your logs, and although I see the compile failure, I don't see any 
reason *why* it failed.  Here's the relevant fail from the tar_openmpi_fail 
file:

-
5523 Making all in mca/filem
5524 make[2]: Entering directory 
'/mnt/c/Users/ofcra/dev/openmpi-3.1.2/orte/mca/filem'
5525   GENERATE orte_filem.7
5526   CC   base/filem_base_frame.lo
5527   CC   base/filem_base_select.lo
5528   CC   base/filem_base_receive.lo
5529   CC   base/filem_base_fns.lo
5530 base/filem_base_receive.c: In function 
‘filem_base_process_get_remote_path_cmd’:
5531 base/filem_base_receive.c:250:9: warning: ignoring return value of 
‘getcwd’, declared with attribute warn_unused_result [-Wunused-result]
5532  getcwd(cwd, sizeof(cwd));
5533  ^~~~
5534 base/filem_base_receive.c:251:9: warning: ignoring return value of 
‘asprintf’, declared with attribute warn_unused_result [-Wunused-result]
5535  asprintf(&tmp_name, "%s/%s", cwd, filename);
5536  ^~~
5537 Makefile:1892: recipe for target 'base/filem_base_select.lo' failed
5538 make[2]: *** [base/filem_base_select.lo] Error 1
5539 make[2]: *** Waiting for unfinished jobs
5540 make[2]: Leaving directory 
'/mnt/c/Users/ofcra/dev/openmpi-3.1.2/orte/mca/filem'
5541 Makefile:2586: recipe for target 'all-recursive' failed
5542 make[1]: *** [all-recursive] Error 1
5543 make[1]: Leaving directory '/mnt/c/Users/ofcra/dev/openmpi-3.1.2/orte'
5544 Makefile:1897: recipe for target 'all-recursive' failed
5545 make: *** [all-recursive] Error 1
-

I.e., I see "recipe for target 'base/filem_base_select.lo' failed" -- but 
there's no error indicating *why* it failed.  There were 2 warnings when 
compiling that file -- but not errors.  That should not have prevented 
compilation for that .c file.

You then went on to run "make check", but that failed predictably because 
"make" had already failed.

You might want to run "make V=1" to see if you can get more details about why 
orte/mca/filem/base/filem_base_select.c failed to compile properly.

It looks like your GitHub clone build failed in exactly the same place.

There's something about filem_base_select.c that is failing to compile -- 
that's what we need more detail on.



> On Sep 18, 2018, at 10:06 AM, Oleg Kmechak  wrote:
> 
> Hello, 
> 
> I am student of Physics from University of Warsaw, and new to OpenMPI. 
> Currently just trying to compile it from source code(tried both github and  
> tar(3.1.2)).
> I am using Windows Linux Subsystem(WLS), Ubuntu. 
> 
> uname -a:
> >Linux Canopus 4.4.0-17134-Microsoft #285-Microsoft Thu Aug 30 17:31:00 PST 
> >2018 x86_64 x86_64 x86_64 GNU/Linux
> 
> I have done all steps suggested in INSTALL and HACKING files, installed next 
> tool i proper order: M4(1.4.18), autoconf(2.69), automake(1.15.1), 
> libtool(2.4.6), flex(2.6.4).
> 
> Next I enabled AUTOMAKE_JOBS=4 and ran: 
> 
> ./autogen.pl #for source code from git hub
> 
> Then
> ./configure --disable-picky --enable-mpi-cxx --without-cma --enable-static
> 
> I added --without-cma cos I have a lot of warnings about compiling asprintf 
> function 
> 
> and finally:
> make -j 4 all #cos I have 4 logical processors
> 
> And in both versions(from github or  tar(3.1.2)) it fails. 
> Github version error:
> >../../../../opal/mca/hwloc/hwloc201/hwloc/include/hwloc.h:71:10: fatal 
> >error: hwloc/bitmap.h: No such file or directory
>  #include 
> 
> And tar(3.1.2) version: 
> >libtool:   error: cannot find the library '../../ompi/libmpi.la' or 
> >unhandled argument '../../ompi/libmpi.la'
> 
> Please see also full log in attachment
> Thanks, hope You will help(cos I passed a lot of time on it currently:) )
> 
> 
> PS: if this is a bug or unimplemented feature(WLS is probably quite specific 
> platform), should I rise issue on github project?
> 
> 
> Regards, Oleg Kmechak
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] [version 2.1.5] invalid memory reference

2018-09-19 Thread Jeff Squyres (jsquyres) via users
Yeah, it's a bit terrible, but we didn't reliably reproduce this problem for 
many months, either.  :-\

As George noted, it's been ported to all the release branches but is not yet in 
an official release.  Until an official release (4.0.0 just had an rc; it will 
be released soon, and 3.0.3 will have an RC in the immediate future), your best 
bet will be to get a nightly tarball from any of the v2.1.x, v3.0.x, v3.1.x, or 
v4.0.x releases.

My $0.02: if you're just upgrading from Open PI v1.7, you might as well jump up 
to v4.0.x (i.e., don't bother jumping to an older release).



> On Sep 19, 2018, at 9:53 AM, George Bosilca  wrote:
> 
> I can't speculate on why you did not notice the memory issue before, simply 
> because for months we (the developers) didn't noticed and our testing 
> infrastructure didn't catch this bug despite running millions of tests. The 
> root cause of the bug was a memory ordering issue, and these are really 
> tricky to identify.
> 
> According to https://github.com/open-mpi/ompi/issues/5638 the patch was 
> backported to all stable releases starting from 2.1. Until their official 
> release however you would either need to get a nightly snapshot or test your 
> luck with master.
> 
>   George.
> 
> 
> On Wed, Sep 19, 2018 at 3:41 AM Patrick Begou 
>  wrote:
> Hi George
> 
> thanks for your answer. I was previously using OpenMPI 3.1.2 and have also 
> this problem. However, using --enable-debug --enable-mem-debug at 
> configuration time, I was unable to reproduce the failure and it was quite 
> difficult for me do trace the problem. May be I have not run enought tests to 
> reach the failure point.
> 
> I fall back to  OpenMPI 2.1.5, thinking the problem was in the 3.x version. 
> The problem was still there but with the debug config I was able to trace the 
> call stack.
> 
> Which OpenMPI 3.x version do you suggest ? A nightly snapshot ? Cloning the 
> git repo ?
> 
> Thanks
> 
> Patrick
> 
> George Bosilca wrote:
>> Few days ago we have pushed a fix in master for a strikingly similar issue. 
>> The patch will eventually make it in the 4.0 and 3.1 but not on the 2.x 
>> series. The best path forward will be to migrate to a more recent OMPI 
>> version.
>> 
>> George.
>> 
>> 
>> On Tue, Sep 18, 2018 at 3:50 AM Patrick Begou 
>>  wrote:
>> Hi
>> 
>> I'm moving a large CFD code from Gcc 4.8.5/OpenMPI 1.7.3 to Gcc 
>> 7.3.0/OpenMPI 2.1.5 and with this latest config I have random segfaults.
>> Same binary, same server, same number of processes (16), same parameters for 
>> the run. Sometimes it runs until the end, sometime I get  'invalid memory 
>> reference'.
>> 
>> Building the application and OpenMPI in debug mode I saw that this random 
>> segfault always occur in collective communications inside OpenMPI. I've no 
>> idea howto track this. These are 2 call stack traces (just the openmpi part):
>> 
>> Calling  MPI_ALLREDUCE(...)
>> 
>> Program received signal SIGSEGV: Segmentation fault - invalid memory 
>> reference.
>> 
>> Backtrace for this error:
>> #0  0x7f01937022ef in ???
>> #1  0x7f0192dd0331 in mca_btl_vader_check_fboxes
>> at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
>> #2  0x7f0192dd0331 in mca_btl_vader_component_progress
>> at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
>> #3  0x7f0192d6b92b in opal_progress
>> at ../../opal/runtime/opal_progress.c:226
>> #4  0x7f0194a8a9a4 in sync_wait_st
>> at ../../opal/threads/wait_sync.h:80
>> #5  0x7f0194a8a9a4 in ompi_request_default_wait_all
>> at ../../ompi/request/req_wait.c:221
>> #6  0x7f0194af1936 in ompi_coll_base_allreduce_intra_recursivedoubling
>> at ../../../../ompi/mca/coll/base/coll_base_allreduce.c:225
>> #7  0x7f0194aa0a0a in PMPI_Allreduce
>> at 
>> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pallreduce.c:107
>> #8  0x7f0194f2e2ba in ompi_allreduce_f
>> at 
>> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pallreduce_f.c:87
>> #9  0x8e21fd in __linear_solver_deflation_m_MOD_solve_el_grp_pcg
>> at linear_solver_deflation_m.f90:341
>> 
>> 
>> Calling MPI_WAITALL()
>> 
>> Program received signal SIGSEGV: Segmentation fault - invalid memory 
>> reference.
>> 
>> Backtrace for this error:
>> #0  0x7fda5a8d72ef in ???
>> #1  0x7fda59fa5331 in mca_btl_vader_check_fboxes
>> at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
>> #2  0x7fda59fa5331 in mca_btl_vader_component_progress
>> at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
>> #3  0x7fda59f4092b in opal_progress
>> at ../../opal/runtime/opal_progress.c:226
>> #4  0x7fda5bc5f9a4 in sync_wait_st
>> at ../../opal/threads/wait_sync.h:80
>> #5  0x7fda5bc5f9a4 in ompi_request_default_wait_all
>> at ../../ompi/request/req_wait.c:221
>> #6  0x7fda5bca329e in PMPI_Waitall
>> at 
>> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pwaitall.c:76
>> #7  0x7fda5c10bc00 in ompi_waitall_f
>>

Re: [OMPI users] Unable to spawn MPI processes on multiple nodes with recent version of OpenMPI

2018-09-19 Thread Andrew Benson
On further investigation removing the "preconnect_all" option does change the 
problem at least. Without "preconnect_all" I no longer see:

--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[32179,2],15]) is on host: node092
  Process 2 ([[32179,2],0]) is on host: unknown!
  BTLs attempted: self tcp vader

Your MPI job is now going to abort; sorry.
--


Instead it hangs for several minutes and finally aborts with:

--
A request has timed out and will therefore fail:

  Operation:  LOOKUP: orted/pmix/pmix_server_pub.c:345

Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.
--
[node091:19470] *** An error occurred in MPI_Comm_spawn
[node091:19470] *** reported by process [1614086145,0]
[node091:19470] *** on communicator MPI_COMM_WORLD
[node091:19470] *** MPI_ERR_UNKNOWN: unknown error
[node091:19470] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[node091:19470] ***and potentially your MPI job)

I've tried increasing both pmix_server_max_wait and pmix_base_exchange_timeout 
as suggested in the error message, but the result is unchanged (it just takes 
longer to time out).

Once again, if I remove "--map-by node" it runs successfully.

-Andrew



On Sunday, September 16, 2018 7:03:15 AM PDT Ralph H Castain wrote:
> I see you are using “preconnect_all” - that is the source of the trouble. I
> don’t believe we have tested that option in years and the code is almost
> certainly dead. I’d suggest removing that option and things should work.
> > On Sep 15, 2018, at 1:46 PM, Andrew Benson  wrote:
> > 
> > I'm running into problems trying to spawn MPI processes across multiple
> > nodes on a cluster using recent versions of OpenMPI. Specifically, using
> > the attached Fortan code, compiled using OpenMPI 3.1.2 with:
> > 
> > mpif90 test.F90 -o test.exe
> > 
> > and run via a PBS scheduler using the attached test1.pbs, it fails as can
> > be seen in the attached testFAIL.err file.
> > 
> > If I do the same but using OpenMPI v1.10.3 then it works successfully,
> > giving me the output in the attached testSUCCESS.err file.
> > 
> > From testing a few different versions of OpenMPI it seems that the
> > behavior
> > changed between v1.10.7 and v2.0.4.
> > 
> > Is there some change in options needed to make this work with newer
> > OpenMPIs?
> > 
> > Output from omp_info --all is attached. config.log can be found here:
> > 
> > http://users.obs.carnegiescience.edu/abenson/config.log.bz2
> > 
> > Thanks for any help you can offer!
> > 
> > -Andrew > ESS.err.bz2>___ users mailing
> > list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 

* Andrew Benson: http://users.obs.carnegiescience.edu/abenson/contact.html

* Galacticus: https://bitbucket.org/abensonca/galacticus

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Unable to spawn MPI processes on multiple nodes with recent version of OpenMPI

2018-09-19 Thread Andrew Benson
On further investigation removing the "preconnect_all" option does change the 
problem at least. Without "preconnect_all" I no longer see:

--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[32179,2],15]) is on host: node092
  Process 2 ([[32179,2],0]) is on host: unknown!
  BTLs attempted: self tcp vader

Your MPI job is now going to abort; sorry.
--


Instead it hangs for several minutes and finally aborts with:

--
A request has timed out and will therefore fail:

  Operation:  LOOKUP: orted/pmix/pmix_server_pub.c:345

Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.
--
[node091:19470] *** An error occurred in MPI_Comm_spawn
[node091:19470] *** reported by process [1614086145,0]
[node091:19470] *** on communicator MPI_COMM_WORLD
[node091:19470] *** MPI_ERR_UNKNOWN: unknown error
[node091:19470] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[node091:19470] ***and potentially your MPI job)

I've tried increasing both pmix_server_max_wait and pmix_base_exchange_timeout 
as suggested in the error message, but the result is unchanged (it just takes 
longer to time out).

Once again, if I remove "--map-by node" it runs successfully.

-Andrew



On Sunday, September 16, 2018 7:03:15 AM PDT Ralph H Castain wrote:
> I see you are using “preconnect_all” - that is the source of the trouble. I
> don’t believe we have tested that option in years and the code is almost
> certainly dead. I’d suggest removing that option and things should work.
> > On Sep 15, 2018, at 1:46 PM, Andrew Benson  wrote:
> > 
> > I'm running into problems trying to spawn MPI processes across multiple
> > nodes on a cluster using recent versions of OpenMPI. Specifically, using
> > the attached Fortan code, compiled using OpenMPI 3.1.2 with:
> > 
> > mpif90 test.F90 -o test.exe
> > 
> > and run via a PBS scheduler using the attached test1.pbs, it fails as can
> > be seen in the attached testFAIL.err file.
> > 
> > If I do the same but using OpenMPI v1.10.3 then it works successfully,
> > giving me the output in the attached testSUCCESS.err file.
> > 
> > From testing a few different versions of OpenMPI it seems that the
> > behavior
> > changed between v1.10.7 and v2.0.4.
> > 
> > Is there some change in options needed to make this work with newer
> > OpenMPIs?
> > 
> > Output from omp_info --all is attached. config.log can be found here:
> > 
> > http://users.obs.carnegiescience.edu/abenson/config.log.bz2
> > 
> > Thanks for any help you can offer!
> > 
> > -Andrew > ESS.err.bz2>___ users mailing
> > list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 

* Andrew Benson: http://users.obs.carnegiescience.edu/abenson/contact.html

* Galacticus: https://bitbucket.org/abensonca/galacticus

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] [version 2.1.5] invalid memory reference

2018-09-19 Thread George Bosilca
I can't speculate on why you did not notice the memory issue before, simply
because for months we (the developers) didn't noticed and our testing
infrastructure didn't catch this bug despite running millions of tests. The
root cause of the bug was a memory ordering issue, and these are really
tricky to identify.

According to https://github.com/open-mpi/ompi/issues/5638 the patch was
backported to all stable releases starting from 2.1. Until their official
release however you would either need to get a nightly snapshot or test
your luck with master.

  George.


On Wed, Sep 19, 2018 at 3:41 AM Patrick Begou <
patrick.be...@legi.grenoble-inp.fr> wrote:

> Hi George
>
> thanks for your answer. I was previously using OpenMPI 3.1.2 and have also
> this problem. However, using --enable-debug --enable-mem-debug at
> configuration time, I was unable to reproduce the failure and it was quite
> difficult for me do trace the problem. May be I have not run enought tests
> to reach the failure point.
>
> I fall back to  OpenMPI 2.1.5, thinking the problem was in the 3.x
> version. The problem was still there but with the debug config I was able
> to trace the call stack.
>
> Which OpenMPI 3.x version do you suggest ? A nightly snapshot ? Cloning
> the git repo ?
>
> Thanks
>
> Patrick
>
> George Bosilca wrote:
>
> Few days ago we have pushed a fix in master for a strikingly similar
> issue. The patch will eventually make it in the 4.0 and 3.1 but not on the
> 2.x series. The best path forward will be to migrate to a more recent OMPI
> version.
>
> George.
>
>
> On Tue, Sep 18, 2018 at 3:50 AM Patrick Begou <
> patrick.be...@legi.grenoble-inp.fr> wrote:
>
>> Hi
>>
>> I'm moving a large CFD code from Gcc 4.8.5/OpenMPI 1.7.3 to Gcc
>> 7.3.0/OpenMPI 2.1.5 and with this latest config I have random segfaults.
>> Same binary, same server, same number of processes (16), same parameters
>> for the run. Sometimes it runs until the end, sometime I get  'invalid
>> memory reference'.
>>
>> Building the application and OpenMPI in debug mode I saw that this random
>> segfault always occur in collective communications inside OpenMPI. I've no
>> idea howto track this. These are 2 call stack traces (just the openmpi
>> part):
>>
>> *Calling  MPI_ALLREDUCE(...)*
>>
>> Program received signal SIGSEGV: Segmentation fault - invalid memory
>> reference.
>>
>> Backtrace for this error:
>> #0  0x7f01937022ef in ???
>> #1  0x7f0192dd0331 in mca_btl_vader_check_fboxes
>> at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
>> #2  0x7f0192dd0331 in mca_btl_vader_component_progress
>> at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
>> #3  0x7f0192d6b92b in opal_progress
>> at ../../opal/runtime/opal_progress.c:226
>> #4  0x7f0194a8a9a4 in sync_wait_st
>> at ../../opal/threads/wait_sync.h:80
>> #5  0x7f0194a8a9a4 in ompi_request_default_wait_all
>> at ../../ompi/request/req_wait.c:221
>> #6  0x7f0194af1936 in ompi_coll_base_allreduce_intra_recursivedoubling
>> at ../../../../ompi/mca/coll/base/coll_base_allreduce.c:225
>> #7  0x7f0194aa0a0a in PMPI_Allreduce
>> at
>> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pallreduce.c:107
>> #8  0x7f0194f2e2ba in ompi_allreduce_f
>> at
>> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pallreduce_f.c:87
>> #9  0x8e21fd in __linear_solver_deflation_m_MOD_solve_el_grp_pcg
>> at linear_solver_deflation_m.f90:341
>>
>>
>> *Calling MPI_WAITALL()*
>>
>> Program received signal SIGSEGV: Segmentation fault - invalid memory
>> reference.
>>
>> Backtrace for this error:
>> #0  0x7fda5a8d72ef in ???
>> #1  0x7fda59fa5331 in mca_btl_vader_check_fboxes
>> at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
>> #2  0x7fda59fa5331 in mca_btl_vader_component_progress
>> at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
>> #3  0x7fda59f4092b in opal_progress
>> at ../../opal/runtime/opal_progress.c:226
>> #4  0x7fda5bc5f9a4 in sync_wait_st
>> at ../../opal/threads/wait_sync.h:80
>> #5  0x7fda5bc5f9a4 in ompi_request_default_wait_all
>> at ../../ompi/request/req_wait.c:221
>> #6  0x7fda5bca329e in PMPI_Waitall
>> at
>> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pwaitall.c:76
>> #7  0x7fda5c10bc00 in ompi_waitall_f
>> at
>> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104
>> #8  0x6dcbf7 in __data_comm_m_MOD_update_ghost_ext_comm_r1
>> at data_comm_m.f90:5849
>>
>>
>> The segfault is alway located in opal/mca/btl/vader/btl_vader_fbox.h at
>> 207/* call the registered callback function */
>> 208   reg->cbfunc(&mca_btl_vader.super, hdr.data.tag, &desc,
>> reg->cbdata);
>>
>>
>> OpenMPI 2.1.5 is build with:
>> CFLAGS="-O3 -march=native -mtune=native" CXXFLAGS="-O3 -march=native
>> -mtune=native" FCFLAGS="-O3 -march=native -mtune=native" \
>> ../configure --prefix=$DESTMPI  --enable

Re: [OMPI users] [version 2.1.5] invalid memory reference

2018-09-19 Thread Patrick Begou

Hi George

thanks for your answer. I was previously using OpenMPI 3.1.2 and have also this 
problem. However, using --enable-debug --enable-mem-debugat configuration time, 
I was unable to reproduce the failure and it was quite difficult for me do trace 
the problem. May be I have not run enought tests to reach the failure point.


I fall back to  OpenMPI 2.1.5, thinking the problem was in the 3.x version. The 
problem was still there but with the debug config I was able to trace the call 
stack.


Which OpenMPI 3.x version do you suggest ? A nightly snapshot ? Cloning the git 
repo ?


Thanks

Patrick

George Bosilca wrote:
Few days ago we have pushed a fix in master for a strikingly similar issue. 
The patch will eventually make it in the 4.0 and 3.1 but not on the 2.x 
series. The best path forward will be to migrate to a more recent OMPI version.


George.


On Tue, Sep 18, 2018 at 3:50 AM Patrick Begou 
> wrote:


Hi

I'm moving a large CFD code from Gcc 4.8.5/OpenMPI 1.7.3 to Gcc
7.3.0/OpenMPI 2.1.5 and with this latest config I have random segfaults.
Same binary, same server, same number of processes (16), same parameters
for the run. Sometimes it runs until the end, sometime I get  'invalid
memory reference'.

Building the application and OpenMPI in debug mode I saw that this random
segfault always occur in collective communications inside OpenMPI. I've no
idea howto track this. These are 2 call stack traces (just the openmpi 
part):

*Calling  MPI_ALLREDUCE(...)**
*
Program received signal SIGSEGV: Segmentation fault - invalid memory
reference.

Backtrace for this error:
#0  0x7f01937022ef in ???
#1  0x7f0192dd0331 in mca_btl_vader_check_fboxes
    at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
#2  0x7f0192dd0331 in mca_btl_vader_component_progress
    at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
#3  0x7f0192d6b92b in opal_progress
    at ../../opal/runtime/opal_progress.c:226
#4  0x7f0194a8a9a4 in sync_wait_st
    at ../../opal/threads/wait_sync.h:80
#5  0x7f0194a8a9a4 in ompi_request_default_wait_all
    at ../../ompi/request/req_wait.c:221
#6  0x7f0194af1936 in ompi_coll_base_allreduce_intra_recursivedoubling
    at ../../../../ompi/mca/coll/base/coll_base_allreduce.c:225
#7  0x7f0194aa0a0a in PMPI_Allreduce
    at

/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pallreduce.c:107
#8  0x7f0194f2e2ba in ompi_allreduce_f
    at

/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pallreduce_f.c:87
#9  0x8e21fd in __linear_solver_deflation_m_MOD_solve_el_grp_pcg
    at linear_solver_deflation_m.f90:341


*Calling MPI_WAITALL()*

Program received signal SIGSEGV: Segmentation fault - invalid memory
reference.

Backtrace for this error:
#0  0x7fda5a8d72ef in ???
#1  0x7fda59fa5331 in mca_btl_vader_check_fboxes
    at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
#2  0x7fda59fa5331 in mca_btl_vader_component_progress
    at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
#3  0x7fda59f4092b in opal_progress
    at ../../opal/runtime/opal_progress.c:226
#4  0x7fda5bc5f9a4 in sync_wait_st
    at ../../opal/threads/wait_sync.h:80
#5  0x7fda5bc5f9a4 in ompi_request_default_wait_all
    at ../../ompi/request/req_wait.c:221
#6  0x7fda5bca329e in PMPI_Waitall
    at

/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pwaitall.c:76
#7  0x7fda5c10bc00 in ompi_waitall_f
    at

/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104
#8  0x6dcbf7 in __data_comm_m_MOD_update_ghost_ext_comm_r1
    at data_comm_m.f90:5849


The segfault is alway located in opal/mca/btl/vader/btl_vader_fbox.h at
207    /* call the registered callback function */
208 reg->cbfunc(&mca_btl_vader.super, hdr.data.tag, &desc, reg->cbdata);


OpenMPI 2.1.5 is build with:
CFLAGS="-O3 -march=native -mtune=native" CXXFLAGS="-O3 -march=native
-mtune=native" FCFLAGS="-O3 -march=native -mtune=native" \
../configure --prefix=$DESTMPI --enable-mpirun-prefix-by-default
--disable-dlopen \
--enable-mca-no-build=openib --without-verbs --enable-mpi-cxx
--without-slurm --enable-mpi-thread-multiple  --enable-debug
--enable-mem-debug

Any help appreciated

Patrick

-- 
===

|  Equipe M.O.S.T. |  |
|  Patrick BEGOU   |mailto:patrick.be...@grenoble-inp.fr  |
|  LEGI|  |
|  BP 53 X | Tel 04 76 82 51 35   |
|  38041 GRENOBLE CEDEX| Fax 04 76 82 52 71