Re: [OMPI users] Setting LD_LIBRARY_PATH for orted

2017-08-22 Thread Jackson, Gary L.
Yup. It looks like I’m stuck with .bashrc.

Thank you all for the suggestions.

-- 
Gary Jackson, Ph.D.
Johns Hopkins University Applied Physics Laboratory

On 8/22/17, 1:07 PM, "users on behalf of r...@open-mpi.org" 
<users-boun...@lists.open-mpi.org on behalf of r...@open-mpi.org> wrote:

I’m afraid not - that only applies the variable to the application, not the 
daemons.

Truly, your only real option is to put something in your .bashrc since you 
cannot modify the configure.

Or, if you are running in a managed environment, you can ask to have your 
resource manager forward your environment to the allocated nodes.

> On Aug 22, 2017, at 9:10 AM, Bennet Fauber <ben...@umich.edu> wrote:
> 
> Would
> 
>$ mpirun -x LD_LIBRARY_PATH ...
> 
> work here?  I think from the man page for mpirun that should request
> that it would would export the currently set value of LD_LIBRARY_PATH
> to the remote nodes prior to executing the command there.
> 
> -- bennet
> 
    > 
    > 
> On Tue, Aug 22, 2017 at 11:55 AM, Jackson, Gary L.
> <gary.jack...@jhuapl.edu> wrote:
>> I’m using a build of OpenMPI provided by a third party.
>> 
>> --
>> Gary Jackson, Ph.D.
>> Johns Hopkins University Applied Physics Laboratory
>> 
>> On 8/21/17, 8:04 PM, "users on behalf of Gilles Gouaillardet" 
<users-boun...@lists.open-mpi.org on behalf of gil...@rist.or.jp> wrote:
>> 
>>Gary,
>> 
>> 
>>one option (as mentioned in the error message) is to configure Open 
MPI
>>with --enable-orterun-prefix-by-default.
>> 
>>this will force the build process to use rpath, so you do not have to
>>set LD_LIBRARY_PATH
>> 
>>this is the easiest option, but cannot be used if you plan to relocate
>>the Open MPI installation directory.
>> 
>> 
>>an other option is to use a wrapper for orted.
>> 
>>mpirun --mca orte_launch_agent /.../myorted ...
>> 
>>where myorted is a script that looks like
>> 
>>#!/bin/sh
>> 
>>export LD_LIBRARY_PATH=...
>> 
>>exec /.../bin/orted "$@"
>> 
>> 
>>you can make this setting system-wide by adding the following line to
>>/.../etc/openmpi-mca-params.conf
>> 
>>orte_launch_agent = /.../myorted
>> 
>> 
>>Cheers,
>> 
>> 
>>Gilles
>> 
>> 
>>On 8/22/2017 1:06 AM, Jackson, Gary L. wrote:
>>> 
>>> I’m using a binary distribution of OpenMPI 1.10.2. As linked, it
>>> requires certain shared libraries outside of OpenMPI for orted itself
>>> to start. So, passing in LD_LIBRARY_PATH with the “-x” flag to mpirun
>>> doesn’t do anything:
>>> 
>>> $ mpirun –hostfile ${HOSTFILE} -N 1 -n 2 -x LD_LIBRARY_PATH hostname
>>> 
>>> /path/to/orted: error while loading shared libraries: LIBRARY.so:
>>> cannot open shared object file: No such file or directory
>>> 
>>> 
--
>>> 
>>> ORTE was unable to reliably start one or more daemons.
>>> 
>>> This usually is caused by:
>>> 
>>> * not finding the required libraries and/or binaries on
>>> 
>>> one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>>> 
>>> settings, or configure OMPI with --enable-orterun-prefix-by-default
>>> 
>>> * lack of authority to execute on one or more specified nodes.
>>> 
>>> Please verify your allocation and authorities.
>>> 
>>> * the inability to write startup files into /tmp
>>> (--tmpdir/orte_tmpdir_base).
>>> 
>>> Please check with your sys admin to determine the correct location to 
use.
>>> 
>>> * compilation of the orted with dynamic libraries when static are 
required
>>> 
>>> (e.g., on Cray). Please check your configure cmd line and consider using
>>> 
>>> one of the contrib/platform definitions for your system type.
>>> 
>>> * an inability to create a connection back to mpirun due to a
>>> 
>>> lack of common network interfaces and/or 

Re: [OMPI users] Setting LD_LIBRARY_PATH for orted

2017-08-22 Thread Jackson, Gary L.
I’m using a build of OpenMPI provided by a third party.

-- 
Gary Jackson, Ph.D.
Johns Hopkins University Applied Physics Laboratory

On 8/21/17, 8:04 PM, "users on behalf of Gilles Gouaillardet" 
<users-boun...@lists.open-mpi.org on behalf of gil...@rist.or.jp> wrote:

Gary,


one option (as mentioned in the error message) is to configure Open MPI 
with --enable-orterun-prefix-by-default.

this will force the build process to use rpath, so you do not have to 
set LD_LIBRARY_PATH

this is the easiest option, but cannot be used if you plan to relocate 
the Open MPI installation directory.


an other option is to use a wrapper for orted.

mpirun --mca orte_launch_agent /.../myorted ...

where myorted is a script that looks like

#!/bin/sh

export LD_LIBRARY_PATH=...

exec /.../bin/orted "$@"


you can make this setting system-wide by adding the following line to 
/.../etc/openmpi-mca-params.conf

orte_launch_agent = /.../myorted


Cheers,


Gilles


On 8/22/2017 1:06 AM, Jackson, Gary L. wrote:
>
> I’m using a binary distribution of OpenMPI 1.10.2. As linked, it 
> requires certain shared libraries outside of OpenMPI for orted itself 
> to start. So, passing in LD_LIBRARY_PATH with the “-x” flag to mpirun 
> doesn’t do anything:
>
> $ mpirun –hostfile ${HOSTFILE} -N 1 -n 2 -x LD_LIBRARY_PATH hostname
>
> /path/to/orted: error while loading shared libraries: LIBRARY.so: 
> cannot open shared object file: No such file or directory
>
> --
>
> ORTE was unable to reliably start one or more daemons.
>
> This usually is caused by:
>
> * not finding the required libraries and/or binaries on
>
> one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>
> settings, or configure OMPI with --enable-orterun-prefix-by-default
>
> * lack of authority to execute on one or more specified nodes.
>
> Please verify your allocation and authorities.
>
> * the inability to write startup files into /tmp 
> (--tmpdir/orte_tmpdir_base).
>
> Please check with your sys admin to determine the correct location to use.
>
> * compilation of the orted with dynamic libraries when static are required
>
> (e.g., on Cray). Please check your configure cmd line and consider using
>
> one of the contrib/platform definitions for your system type.
>
> * an inability to create a connection back to mpirun due to a
>
> lack of common network interfaces and/or no route found between
>
> them. Please check network connectivity (including firewalls
>
> and network routing requirements).
>
> --
>
> How do I get around this cleanly? This works just fine when I set 
> LD_LIBRARY_PATH in my .bashrc, but I’d rather not pollute that if I 
> can avoid it.
>
> -- 
>
> Gary Jackson, Ph.D.
>
> Johns Hopkins University Applied Physics Laboratory
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Setting LD_LIBRARY_PATH for orted

2017-08-21 Thread Jackson, Gary L.

I’m using a binary distribution of OpenMPI 1.10.2. As linked, it requires 
certain shared libraries outside of OpenMPI for orted itself to start. So, 
passing in LD_LIBRARY_PATH with the “-x” flag to mpirun doesn’t do anything:

$ mpirun –hostfile ${HOSTFILE} -N 1 -n 2 -x LD_LIBRARY_PATH hostname
/path/to/orted: error while loading shared libraries: LIBRARY.so: cannot open 
shared object file: No such file or directory
--
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--

How do I get around this cleanly? This works just fine when I set 
LD_LIBRARY_PATH in my .bashrc, but I’d rather not pollute that if I can avoid 
it.

--
Gary Jackson, Ph.D.
Johns Hopkins University Applied Physics Laboratory

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Poor performance on Amazon EC2 with TCP

2016-03-13 Thread Jackson, Gary L.
By doing a parameter sweep, the best results I've gotten are with:

btl_tcp_eager_limit & btl_tcp_rndv_eager_limit = 2 ** 17
btl_tcp_sndbuf & btl_tcp_rcvbuf = 2 ** 24
btl_tcp_endpoint_cache = 2 ** 12
btl_tcp_tcp_links = 2

Even so, the peak performance is around 7000Mbits/s and a message size
around a megabyte.

For what it's worth, you can get access to AWS resources to do your own
tuning, which may be more expedient than working through me as a proxy.
Right now I'm using two c4.8xlarge instances in a placement group at
$1.675/hour each to work this out. I only keep them around for as long as
I'm using them, then I terminate the instances when I'm done working.

-- 
Gary Jackson



From:  users <users-boun...@open-mpi.org> on behalf of George Bosilca
<bosi...@icl.utk.edu>
Reply-To:  Open MPI Users <us...@open-mpi.org>
List-Post: users@lists.open-mpi.org
Date:  Friday, March 11, 2016 at 11:19 AM
To:  Open MPI Users <us...@open-mpi.org>
Subject:  Re: [OMPI users] Poor performance on Amazon EC2 with TCP


Gary,

The current fine tuning of our TCP layer was done on a 1Gb network, and
might result in the performance degradation you see. There is a
relationship between the depth of the pipeline and the length of the
packets, together with another set of
 MCA parameters that can have a drastic impact on performance.

You should start with ³ompi_info ‹param btl tcp -l 9².

>From your performance graphs I can see that Intel MPI has an eager size of
around 128k (while ours is at 32k). Try to address this by setting
btl_tcp_eager_limit to 128k and also btl_tcp_rndv_eager_limit to the same
value.

By default Open MPI assumes TCP kernel buffers of 128k. These values can
be tuned at the kernel level
(http://www.cyberciti.biz/faq/linux-tcp-tuning/) and/or you can let Open
 MPI know that it can use more (by setting the MCA parameters
btl_tcp_sndbuf and btl_tcp_rcvbuf).

Then you can play with the size of the TCP endpoint caching (it should be
set to a value where the memcpy is about the same cost as a syscall).
btl_tcp_endpoint_cache is the MCA parameter you are looking for.

Another trick, in case the injection rate of a single fd is too slow you
can ask Open MPI to use multiple channels by setting btl_tcp_links to
something else than 1. On a PS4 I had to bump it up to 3-4 to get the best
performance.

Other parameters to be tuned:
- btl_tcp_max_send_size
- btl_tcp_rdma_pipeline_send_length

I don¹t have access to a 10Gb network to tune. If you manage to tune it, I
would like to get the values for the different MCA parameters so that out
TCP BTL behaves optimally by default.

  Thanks,
George.



On Mar 10, 2016, at 11:45 , Jackson, Gary L. <gary.jack...@jhuapl.edu>
wrote:

I re-ran all experiments with 1.10.2 configured the way you specified. My
results are here:

https://www.dropbox.com/s/4v4jaxe8sflgymj/collected.pdf?dl=0

Some remarks:

1. OpenMPI had poor performance relative to raw TCP and IMPI across all
MTUs.
2. Those issues appeared at larger message sizes.
3. Intel MPI and raw TCP were comparable across message sizes and MTUs.

With respect to some other concerns:

1. I verified that the MTU values I'm using are correct with tracepath.
2. I am using a placement group.

-- 
Gary Jackson



From: users <users-boun...@open-mpi.org> on behalf of Gilles Gouaillardet
<gil...@rist.or.jp>
Reply-To: Open MPI Users <us...@open-mpi.org>
List-Post: users@lists.open-mpi.org
Date: Tuesday, March 8, 2016 at 11:07 PM
To: Open MPI Users <us...@open-mpi.org>
Subject: Re: [OMPI users] Poor performance on Amazon EC2 with TCP


Jackson,

one more thing, how did you build openmpi ?

if you built from git (and without VPATH), then --enable-debug is
automatically set, and this is hurting performance.
if not already done, i recommend you download the latest openmpi tarball
(1.10.2) and
./configure --with-platform=contrib/platform/optimized --prefix=...
last but not least, you can
mpirun --mca mpi_leave_pinned 1 
(that being said, i am not sure this is useful with TCP networks ...)

Cheers,

Gilles



On 3/9/2016 11:34 AM, Rayson Ho wrote:


If you are using instance types that support SR-IOV (aka. "enhanced
networking" in AWS), then turn it on. We saw huge differences when SR-IOV
is enabled

http://blogs.scalablelogic.com/2013/12/enhanced-networking-in-aws-cloud.htm
l

http://blogs.scalablelogic.com/2014/01/enhanced-networking-in-aws-cloud-par
t-2.html


Make sure you start your instances with a placement group -- otherwise,
the instances can be data centers apart!


And check that jumbo frames are enabled properly:

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/network_mtu.html

But still, it is interesting that Intel MPI is getting a 2X speedup with
the same setup! Can you post the raw numbers so that we can take a deeper
look??

Rayson

==
Open Grid Scheduler - The Official Open Source Grid Engine

Re: [OMPI users] Poor performance on Amazon EC2 with TCP

2016-03-10 Thread Jackson, Gary L.
I re-ran all experiments with 1.10.2 configured the way you specified. My 
results are here:

https://www.dropbox.com/s/4v4jaxe8sflgymj/collected.pdf?dl=0

Some remarks:

  1.  OpenMPI had poor performance relative to raw TCP and IMPI across all MTUs.
  2.  Those issues appeared at larger message sizes.
  3.  Intel MPI and raw TCP were comparable across message sizes and MTUs.

With respect to some other concerns:

  1.  I verified that the MTU values I'm using are correct with tracepath.
  2.  I am using a placement group.

--
Gary Jackson

From: users <users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org>> on 
behalf of Gilles Gouaillardet <gil...@rist.or.jp<mailto:gil...@rist.or.jp>>
Reply-To: Open MPI Users <us...@open-mpi.org<mailto:us...@open-mpi.org>>
List-Post: users@lists.open-mpi.org
Date: Tuesday, March 8, 2016 at 11:07 PM
To: Open MPI Users <us...@open-mpi.org<mailto:us...@open-mpi.org>>
Subject: Re: [OMPI users] Poor performance on Amazon EC2 with TCP

Jackson,

one more thing, how did you build openmpi ?

if you built from git (and without VPATH), then --enable-debug is automatically 
set, and this is hurting performance.
if not already done, i recommend you download the latest openmpi tarball 
(1.10.2) and
./configure --with-platform=contrib/platform/optimized --prefix=...
last but not least, you can
mpirun --mca mpi_leave_pinned 1 
(that being said, i am not sure this is useful with TCP networks ...)

Cheers,

Gilles



On 3/9/2016 11:34 AM, Rayson Ho wrote:
If you are using instance types that support SR-IOV (aka. "enhanced networking" 
in AWS), then turn it on. We saw huge differences when SR-IOV is enabled

http://blogs.scalablelogic.com/2013/12/enhanced-networking-in-aws-cloud.html
http://blogs.scalablelogic.com/2014/01/enhanced-networking-in-aws-cloud-part-2.html

Make sure you start your instances with a placement group -- otherwise, the 
instances can be data centers apart!

And check that jumbo frames are enabled properly:

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/network_mtu.html

But still, it is interesting that Intel MPI is getting a 2X speedup with the 
same setup! Can you post the raw numbers so that we can take a deeper look??

Rayson

==
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html




On Tue, Mar 8, 2016 at 9:08 AM, Jackson, Gary L. 
<<mailto:gary.jack...@jhuapl.edu>gary.jack...@jhuapl.edu<mailto:gary.jack...@jhuapl.edu>>
 wrote:

I've built OpenMPI 1.10.1 on Amazon EC2. Using NetPIPE, I'm seeing about half 
the performance for MPI over TCP as I do with raw TCP. Before I start digging 
in to this more deeply, does anyone know what might cause that?

For what it's worth, I see the same issues with MPICH, but I do not see it with 
Intel MPI.

--
Gary Jackson


___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/03/28659.php




___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/03/28665.php



Re: [OMPI users] Poor performance on Amazon EC2 with TCP

2016-03-08 Thread Jackson, Gary L.
Nope, just one ethernet interface:

$ ifconfig
eth0  Link encap:Ethernet  HWaddr 0E:47:0E:0B:59:27
  inet addr:xxx.xxx.xxx.xxx  Bcast:xxx.xxx.xxx.xxx
Mask:255.255.252.0
  inet6 addr: fe80::c47:eff:fe0b:5927/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:9001  Metric:1
  RX packets:16962 errors:0 dropped:0 overruns:0 frame:0
  TX packets:11564 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:28613867 (27.2 MiB)  TX bytes:1092650 (1.0 MiB)

loLink encap:Local Loopback
  inet addr:127.0.0.1  Mask:255.0.0.0
  inet6 addr: ::1/128 Scope:Host
  UP LOOPBACK RUNNING  MTU:65536  Metric:1
  RX packets:68 errors:0 dropped:0 overruns:0 frame:0
  TX packets:68 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:6647 (6.4 KiB)  TX bytes:6647 (6.4 KiB)


-- 
Gary Jackson




From:  users <users-boun...@open-mpi.org> on behalf of Gilles Gouaillardet
<gilles.gouaillar...@gmail.com>
Reply-To:  Open MPI Users <us...@open-mpi.org>
List-Post: users@lists.open-mpi.org
Date:  Tuesday, March 8, 2016 at 9:39 AM
To:  Open MPI Users <us...@open-mpi.org>
Subject:  Re: [OMPI users] Poor performance on Amazon EC2 with TCP


Jason,

how many Ethernet interfaces are there ?
if several, can you try again with one only
mpirun --mca btl_tcp_if_include eth0 ...

Cheers,

Gilles

On Tuesday, March 8, 2016, Jackson, Gary L. <gary.jack...@jhuapl.edu>
wrote:


I've built OpenMPI 1.10.1 on Amazon EC2. Using NetPIPE, I'm seeing about
half the performance for MPI over TCP as I do with raw TCP. Before I start
digging in to this more deeply, does anyone know what might cause that?

For what it's worth, I see the same issues with MPICH, but I do not see it
with Intel MPI.

-- 
Gary Jackson



[OMPI users] Poor performance on Amazon EC2 with TCP

2016-03-08 Thread Jackson, Gary L.

I've built OpenMPI 1.10.1 on Amazon EC2. Using NetPIPE, I'm seeing about half 
the performance for MPI over TCP as I do with raw TCP. Before I start digging 
in to this more deeply, does anyone know what might cause that?

For what it's worth, I see the same issues with MPICH, but I do not see it with 
Intel MPI.

--
Gary Jackson