Re: [OMPI users] Setting LD_LIBRARY_PATH for orted
Yup. It looks like I’m stuck with .bashrc. Thank you all for the suggestions. -- Gary Jackson, Ph.D. Johns Hopkins University Applied Physics Laboratory On 8/22/17, 1:07 PM, "users on behalf of r...@open-mpi.org" <users-boun...@lists.open-mpi.org on behalf of r...@open-mpi.org> wrote: I’m afraid not - that only applies the variable to the application, not the daemons. Truly, your only real option is to put something in your .bashrc since you cannot modify the configure. Or, if you are running in a managed environment, you can ask to have your resource manager forward your environment to the allocated nodes. > On Aug 22, 2017, at 9:10 AM, Bennet Fauber <ben...@umich.edu> wrote: > > Would > >$ mpirun -x LD_LIBRARY_PATH ... > > work here? I think from the man page for mpirun that should request > that it would would export the currently set value of LD_LIBRARY_PATH > to the remote nodes prior to executing the command there. > > -- bennet > > > > On Tue, Aug 22, 2017 at 11:55 AM, Jackson, Gary L. > <gary.jack...@jhuapl.edu> wrote: >> I’m using a build of OpenMPI provided by a third party. >> >> -- >> Gary Jackson, Ph.D. >> Johns Hopkins University Applied Physics Laboratory >> >> On 8/21/17, 8:04 PM, "users on behalf of Gilles Gouaillardet" <users-boun...@lists.open-mpi.org on behalf of gil...@rist.or.jp> wrote: >> >>Gary, >> >> >>one option (as mentioned in the error message) is to configure Open MPI >>with --enable-orterun-prefix-by-default. >> >>this will force the build process to use rpath, so you do not have to >>set LD_LIBRARY_PATH >> >>this is the easiest option, but cannot be used if you plan to relocate >>the Open MPI installation directory. >> >> >>an other option is to use a wrapper for orted. >> >>mpirun --mca orte_launch_agent /.../myorted ... >> >>where myorted is a script that looks like >> >>#!/bin/sh >> >>export LD_LIBRARY_PATH=... >> >>exec /.../bin/orted "$@" >> >> >>you can make this setting system-wide by adding the following line to >>/.../etc/openmpi-mca-params.conf >> >>orte_launch_agent = /.../myorted >> >> >>Cheers, >> >> >>Gilles >> >> >>On 8/22/2017 1:06 AM, Jackson, Gary L. wrote: >>> >>> I’m using a binary distribution of OpenMPI 1.10.2. As linked, it >>> requires certain shared libraries outside of OpenMPI for orted itself >>> to start. So, passing in LD_LIBRARY_PATH with the “-x” flag to mpirun >>> doesn’t do anything: >>> >>> $ mpirun –hostfile ${HOSTFILE} -N 1 -n 2 -x LD_LIBRARY_PATH hostname >>> >>> /path/to/orted: error while loading shared libraries: LIBRARY.so: >>> cannot open shared object file: No such file or directory >>> >>> -- >>> >>> ORTE was unable to reliably start one or more daemons. >>> >>> This usually is caused by: >>> >>> * not finding the required libraries and/or binaries on >>> >>> one or more nodes. Please check your PATH and LD_LIBRARY_PATH >>> >>> settings, or configure OMPI with --enable-orterun-prefix-by-default >>> >>> * lack of authority to execute on one or more specified nodes. >>> >>> Please verify your allocation and authorities. >>> >>> * the inability to write startup files into /tmp >>> (--tmpdir/orte_tmpdir_base). >>> >>> Please check with your sys admin to determine the correct location to use. >>> >>> * compilation of the orted with dynamic libraries when static are required >>> >>> (e.g., on Cray). Please check your configure cmd line and consider using >>> >>> one of the contrib/platform definitions for your system type. >>> >>> * an inability to create a connection back to mpirun due to a >>> >>> lack of common network interfaces and/or
Re: [OMPI users] Setting LD_LIBRARY_PATH for orted
I’m using a build of OpenMPI provided by a third party. -- Gary Jackson, Ph.D. Johns Hopkins University Applied Physics Laboratory On 8/21/17, 8:04 PM, "users on behalf of Gilles Gouaillardet" <users-boun...@lists.open-mpi.org on behalf of gil...@rist.or.jp> wrote: Gary, one option (as mentioned in the error message) is to configure Open MPI with --enable-orterun-prefix-by-default. this will force the build process to use rpath, so you do not have to set LD_LIBRARY_PATH this is the easiest option, but cannot be used if you plan to relocate the Open MPI installation directory. an other option is to use a wrapper for orted. mpirun --mca orte_launch_agent /.../myorted ... where myorted is a script that looks like #!/bin/sh export LD_LIBRARY_PATH=... exec /.../bin/orted "$@" you can make this setting system-wide by adding the following line to /.../etc/openmpi-mca-params.conf orte_launch_agent = /.../myorted Cheers, Gilles On 8/22/2017 1:06 AM, Jackson, Gary L. wrote: > > I’m using a binary distribution of OpenMPI 1.10.2. As linked, it > requires certain shared libraries outside of OpenMPI for orted itself > to start. So, passing in LD_LIBRARY_PATH with the “-x” flag to mpirun > doesn’t do anything: > > $ mpirun –hostfile ${HOSTFILE} -N 1 -n 2 -x LD_LIBRARY_PATH hostname > > /path/to/orted: error while loading shared libraries: LIBRARY.so: > cannot open shared object file: No such file or directory > > -- > > ORTE was unable to reliably start one or more daemons. > > This usually is caused by: > > * not finding the required libraries and/or binaries on > > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > > settings, or configure OMPI with --enable-orterun-prefix-by-default > > * lack of authority to execute on one or more specified nodes. > > Please verify your allocation and authorities. > > * the inability to write startup files into /tmp > (--tmpdir/orte_tmpdir_base). > > Please check with your sys admin to determine the correct location to use. > > * compilation of the orted with dynamic libraries when static are required > > (e.g., on Cray). Please check your configure cmd line and consider using > > one of the contrib/platform definitions for your system type. > > * an inability to create a connection back to mpirun due to a > > lack of common network interfaces and/or no route found between > > them. Please check network connectivity (including firewalls > > and network routing requirements). > > -- > > How do I get around this cleanly? This works just fine when I set > LD_LIBRARY_PATH in my .bashrc, but I’d rather not pollute that if I > can avoid it. > > -- > > Gary Jackson, Ph.D. > > Johns Hopkins University Applied Physics Laboratory > > > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] Setting LD_LIBRARY_PATH for orted
I’m using a binary distribution of OpenMPI 1.10.2. As linked, it requires certain shared libraries outside of OpenMPI for orted itself to start. So, passing in LD_LIBRARY_PATH with the “-x” flag to mpirun doesn’t do anything: $ mpirun –hostfile ${HOSTFILE} -N 1 -n 2 -x LD_LIBRARY_PATH hostname /path/to/orted: error while loading shared libraries: LIBRARY.so: cannot open shared object file: No such file or directory -- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -- How do I get around this cleanly? This works just fine when I set LD_LIBRARY_PATH in my .bashrc, but I’d rather not pollute that if I can avoid it. -- Gary Jackson, Ph.D. Johns Hopkins University Applied Physics Laboratory ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Poor performance on Amazon EC2 with TCP
By doing a parameter sweep, the best results I've gotten are with: btl_tcp_eager_limit & btl_tcp_rndv_eager_limit = 2 ** 17 btl_tcp_sndbuf & btl_tcp_rcvbuf = 2 ** 24 btl_tcp_endpoint_cache = 2 ** 12 btl_tcp_tcp_links = 2 Even so, the peak performance is around 7000Mbits/s and a message size around a megabyte. For what it's worth, you can get access to AWS resources to do your own tuning, which may be more expedient than working through me as a proxy. Right now I'm using two c4.8xlarge instances in a placement group at $1.675/hour each to work this out. I only keep them around for as long as I'm using them, then I terminate the instances when I'm done working. -- Gary Jackson From: users <users-boun...@open-mpi.org> on behalf of George Bosilca <bosi...@icl.utk.edu> Reply-To: Open MPI Users <us...@open-mpi.org> List-Post: users@lists.open-mpi.org Date: Friday, March 11, 2016 at 11:19 AM To: Open MPI Users <us...@open-mpi.org> Subject: Re: [OMPI users] Poor performance on Amazon EC2 with TCP Gary, The current fine tuning of our TCP layer was done on a 1Gb network, and might result in the performance degradation you see. There is a relationship between the depth of the pipeline and the length of the packets, together with another set of MCA parameters that can have a drastic impact on performance. You should start with ³ompi_info ‹param btl tcp -l 9². >From your performance graphs I can see that Intel MPI has an eager size of around 128k (while ours is at 32k). Try to address this by setting btl_tcp_eager_limit to 128k and also btl_tcp_rndv_eager_limit to the same value. By default Open MPI assumes TCP kernel buffers of 128k. These values can be tuned at the kernel level (http://www.cyberciti.biz/faq/linux-tcp-tuning/) and/or you can let Open MPI know that it can use more (by setting the MCA parameters btl_tcp_sndbuf and btl_tcp_rcvbuf). Then you can play with the size of the TCP endpoint caching (it should be set to a value where the memcpy is about the same cost as a syscall). btl_tcp_endpoint_cache is the MCA parameter you are looking for. Another trick, in case the injection rate of a single fd is too slow you can ask Open MPI to use multiple channels by setting btl_tcp_links to something else than 1. On a PS4 I had to bump it up to 3-4 to get the best performance. Other parameters to be tuned: - btl_tcp_max_send_size - btl_tcp_rdma_pipeline_send_length I don¹t have access to a 10Gb network to tune. If you manage to tune it, I would like to get the values for the different MCA parameters so that out TCP BTL behaves optimally by default. Thanks, George. On Mar 10, 2016, at 11:45 , Jackson, Gary L. <gary.jack...@jhuapl.edu> wrote: I re-ran all experiments with 1.10.2 configured the way you specified. My results are here: https://www.dropbox.com/s/4v4jaxe8sflgymj/collected.pdf?dl=0 Some remarks: 1. OpenMPI had poor performance relative to raw TCP and IMPI across all MTUs. 2. Those issues appeared at larger message sizes. 3. Intel MPI and raw TCP were comparable across message sizes and MTUs. With respect to some other concerns: 1. I verified that the MTU values I'm using are correct with tracepath. 2. I am using a placement group. -- Gary Jackson From: users <users-boun...@open-mpi.org> on behalf of Gilles Gouaillardet <gil...@rist.or.jp> Reply-To: Open MPI Users <us...@open-mpi.org> List-Post: users@lists.open-mpi.org Date: Tuesday, March 8, 2016 at 11:07 PM To: Open MPI Users <us...@open-mpi.org> Subject: Re: [OMPI users] Poor performance on Amazon EC2 with TCP Jackson, one more thing, how did you build openmpi ? if you built from git (and without VPATH), then --enable-debug is automatically set, and this is hurting performance. if not already done, i recommend you download the latest openmpi tarball (1.10.2) and ./configure --with-platform=contrib/platform/optimized --prefix=... last but not least, you can mpirun --mca mpi_leave_pinned 1 (that being said, i am not sure this is useful with TCP networks ...) Cheers, Gilles On 3/9/2016 11:34 AM, Rayson Ho wrote: If you are using instance types that support SR-IOV (aka. "enhanced networking" in AWS), then turn it on. We saw huge differences when SR-IOV is enabled http://blogs.scalablelogic.com/2013/12/enhanced-networking-in-aws-cloud.htm l http://blogs.scalablelogic.com/2014/01/enhanced-networking-in-aws-cloud-par t-2.html Make sure you start your instances with a placement group -- otherwise, the instances can be data centers apart! And check that jumbo frames are enabled properly: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/network_mtu.html But still, it is interesting that Intel MPI is getting a 2X speedup with the same setup! Can you post the raw numbers so that we can take a deeper look?? Rayson == Open Grid Scheduler - The Official Open Source Grid Engine
Re: [OMPI users] Poor performance on Amazon EC2 with TCP
I re-ran all experiments with 1.10.2 configured the way you specified. My results are here: https://www.dropbox.com/s/4v4jaxe8sflgymj/collected.pdf?dl=0 Some remarks: 1. OpenMPI had poor performance relative to raw TCP and IMPI across all MTUs. 2. Those issues appeared at larger message sizes. 3. Intel MPI and raw TCP were comparable across message sizes and MTUs. With respect to some other concerns: 1. I verified that the MTU values I'm using are correct with tracepath. 2. I am using a placement group. -- Gary Jackson From: users <users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org>> on behalf of Gilles Gouaillardet <gil...@rist.or.jp<mailto:gil...@rist.or.jp>> Reply-To: Open MPI Users <us...@open-mpi.org<mailto:us...@open-mpi.org>> List-Post: users@lists.open-mpi.org Date: Tuesday, March 8, 2016 at 11:07 PM To: Open MPI Users <us...@open-mpi.org<mailto:us...@open-mpi.org>> Subject: Re: [OMPI users] Poor performance on Amazon EC2 with TCP Jackson, one more thing, how did you build openmpi ? if you built from git (and without VPATH), then --enable-debug is automatically set, and this is hurting performance. if not already done, i recommend you download the latest openmpi tarball (1.10.2) and ./configure --with-platform=contrib/platform/optimized --prefix=... last but not least, you can mpirun --mca mpi_leave_pinned 1 (that being said, i am not sure this is useful with TCP networks ...) Cheers, Gilles On 3/9/2016 11:34 AM, Rayson Ho wrote: If you are using instance types that support SR-IOV (aka. "enhanced networking" in AWS), then turn it on. We saw huge differences when SR-IOV is enabled http://blogs.scalablelogic.com/2013/12/enhanced-networking-in-aws-cloud.html http://blogs.scalablelogic.com/2014/01/enhanced-networking-in-aws-cloud-part-2.html Make sure you start your instances with a placement group -- otherwise, the instances can be data centers apart! And check that jumbo frames are enabled properly: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/network_mtu.html But still, it is interesting that Intel MPI is getting a 2X speedup with the same setup! Can you post the raw numbers so that we can take a deeper look?? Rayson == Open Grid Scheduler - The Official Open Source Grid Engine http://gridscheduler.sourceforge.net/ http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html On Tue, Mar 8, 2016 at 9:08 AM, Jackson, Gary L. <<mailto:gary.jack...@jhuapl.edu>gary.jack...@jhuapl.edu<mailto:gary.jack...@jhuapl.edu>> wrote: I've built OpenMPI 1.10.1 on Amazon EC2. Using NetPIPE, I'm seeing about half the performance for MPI over TCP as I do with raw TCP. Before I start digging in to this more deeply, does anyone know what might cause that? For what it's worth, I see the same issues with MPICH, but I do not see it with Intel MPI. -- Gary Jackson ___ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/03/28659.php ___ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/03/28665.php
Re: [OMPI users] Poor performance on Amazon EC2 with TCP
Nope, just one ethernet interface: $ ifconfig eth0 Link encap:Ethernet HWaddr 0E:47:0E:0B:59:27 inet addr:xxx.xxx.xxx.xxx Bcast:xxx.xxx.xxx.xxx Mask:255.255.252.0 inet6 addr: fe80::c47:eff:fe0b:5927/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:9001 Metric:1 RX packets:16962 errors:0 dropped:0 overruns:0 frame:0 TX packets:11564 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:28613867 (27.2 MiB) TX bytes:1092650 (1.0 MiB) loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:68 errors:0 dropped:0 overruns:0 frame:0 TX packets:68 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:6647 (6.4 KiB) TX bytes:6647 (6.4 KiB) -- Gary Jackson From: users <users-boun...@open-mpi.org> on behalf of Gilles Gouaillardet <gilles.gouaillar...@gmail.com> Reply-To: Open MPI Users <us...@open-mpi.org> List-Post: users@lists.open-mpi.org Date: Tuesday, March 8, 2016 at 9:39 AM To: Open MPI Users <us...@open-mpi.org> Subject: Re: [OMPI users] Poor performance on Amazon EC2 with TCP Jason, how many Ethernet interfaces are there ? if several, can you try again with one only mpirun --mca btl_tcp_if_include eth0 ... Cheers, Gilles On Tuesday, March 8, 2016, Jackson, Gary L. <gary.jack...@jhuapl.edu> wrote: I've built OpenMPI 1.10.1 on Amazon EC2. Using NetPIPE, I'm seeing about half the performance for MPI over TCP as I do with raw TCP. Before I start digging in to this more deeply, does anyone know what might cause that? For what it's worth, I see the same issues with MPICH, but I do not see it with Intel MPI. -- Gary Jackson
[OMPI users] Poor performance on Amazon EC2 with TCP
I've built OpenMPI 1.10.1 on Amazon EC2. Using NetPIPE, I'm seeing about half the performance for MPI over TCP as I do with raw TCP. Before I start digging in to this more deeply, does anyone know what might cause that? For what it's worth, I see the same issues with MPICH, but I do not see it with Intel MPI. -- Gary Jackson