Re: [OMPI users] question about the Open-MPI ABI

2023-02-01 Thread Barrett, Brian via users
Because we’ve screwed up in the past?  I think the ompi_message_null was me, 
and I was in a hurry to prototype for the MPI Forum.  And then it stuck.

Brian

On 2/1/23, 3:16 AM, "users on behalf of Jeff Hammond via users" 
mailto:users-boun...@lists.open-mpi.org> on 
behalf of users@lists.open-mpi.org> wrote:


Why do the null handles not follow a consistent scheme, at least in Open-MPI 
4.1.2?

ompi_mpi__null is used except when handle={request,message}, which drop 
the "mpi_".

The above have an associated ..null_addr except ompi_mpi_datatype_null and 
ompi_message_null.

Why?

Jeff

Open MPI v4.1.2, package: Debian OpenMPI, ident: 4.1.2, repo rev: v4.1.2, Nov 
24, 2021

$ nm -gD /usr/lib/x86_64-linux-gnu/libmpi.so | grep ompi | grep null | grep -v 
fn
00134040 B ompi_message_null
00126300 B ompi_mpi_comm_null
00115898 D ompi_mpi_comm_null_addr
00120f40 D ompi_mpi_datatype_null
0012cb00 B ompi_mpi_errhandler_null
00116030 D ompi_mpi_errhandler_null_addr
00134660 B ompi_mpi_file_null
001163e8 D ompi_mpi_file_null_addr
00126200 B ompi_mpi_group_null
00115890 D ompi_mpi_group_null_addr
0012cf80 B ompi_mpi_info_null
00116038 D ompi_mpi_info_null_addr
00133720 B ompi_mpi_op_null
001163c0 D ompi_mpi_op_null_addr
00135740 B ompi_mpi_win_null
00117c80 D ompi_mpi_win_null_addr
0012d080 B ompi_request_null
00116040 D ompi_request_null_addr


--
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/


Re: [OMPI users] Beginner Troubleshooting OpenMPI Installation - pmi.h Error

2022-10-04 Thread Barrett, Brian via users
Can you include the configure command you used for Open MPI, as well as the 
output of “make all V=1” (it’s ok if that’s from a tree you’ve already tried to 
build, the full compile command for the file that is failing to compile is the 
part of interest.

Thanks,

Brian

On 10/4/22, 9:06 AM, "users on behalf of Jeffrey D. (JD) Tamucci via users" 
mailto:users-boun...@lists.open-mpi.org> on 
behalf of users@lists.open-mpi.org> wrote:


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


Hi,

I have been trying to install OpenMPI v4.1.4 on a university HPC cluster. We 
use the Bright cluster manager and have SLURM v21.08.8 and RHEL 8.6. I used a 
script to install OpenMPI that a former co-worker had used to successfully 
install OpenMPI v3.0.0 previously. I updated it to include new versions of the 
dependencies and new paths to those installs.

Each time, it fails in the make install step. There is a fatal error about 
finding pmi.h. It specifically says:

make[2]: Entering directory '/shared/maylab/src/openmpi-4.1.4/opal/mca/pmix/s1'
  CC   libmca_pmix_s1_la-pmix_s1_component.lo
  CC   libmca_pmix_s1_la-pmix_s1.lo
pmix_s1.c:29:10: fatal error: pmi.h: No such file or directory
   29 | #include 

I've looked through the archives and seen others face similar errors in years 
past but I couldn't understand the solutions. One person suggested that SLURM 
may be missing PMI libraries. I think I've verified that SLURM has PMI. I 
include paths to those files and it seems to find them earlier in the process.

I'm not sure what the next step is in troubleshooting this. I have included a 
bz2 file containing my install script, a log file containing the script output 
(from build, make, make install), the config.log, and the opal_config.h file. 
If anyone could provide any guidance, I'd  sincerely appreciate it.

Best,
JD


Re: [OMPI users] MPI_THREAD_MULTIPLE question

2022-09-14 Thread Barrett, Brian via users
Yes, this is the case for Open MPI 4.x and earlier, due to various bugs.  When 
Open MPI 5.0 ships, we will resolve this issue.

Brian

On 9/9/22, 9:58 PM, "users on behalf of mrlong336 via users" 
mailto:users-boun...@lists.open-mpi.org> on 
behalf of users@lists.open-mpi.org> wrote:


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


mpirun reports the following error:
The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this release.
Workarounds are to run on a single node, or to use a system with an RDMA
capable network such as Infiniband.

Does this error mean that the network must support RDMA if it wants to run 
distributed? Will Gigabit/10 Gigabit Ethernet work?



Best regards,

Timesir

mrlong...@gmail.com











Re: [OMPI users] How to set parameters to utilize multiple network interfaces?

2021-06-11 Thread Barrett, Brian via users
John -

Open MPI's OFI implementation does not stripe messages across processes.  
Instead, an Open MPI process will choose the "closest" NIC on the system (based 
on PCI hops and PCI topology, using hwloc).  If there is ore than one "closest" 
NIC, as is the case on P4, where each Intel socket has two PCI switches, each 
with 2 GPUs and an EFA NIC behind them, then the processes will round-robin 
between the N closest NICs.  This isn't perfect, and the algorithm can get the 
wrong answer for some situations, but on P4 it should generally always get the 
right answer.  The reason for this implementation is that Open MPI uses OFI's 
tagged matching interface and message striping across multiple tagged-matching 
interfaces is rather overly complicated.  An OFI provider could choose to 
stripe messages across devices internally, of course, but we believe that, 
given the topologies involved and the limited cross-PCI switch bandwidths 
available on platforms like P4, that round robin assignment is more beneficial 
to application performance.

This is why you see differences between osu_bw and osu_bw_mr - one is showing 
you single process pair data (meaning you are maxing out the single NIC that 
was chosen) and the other involves multiple processes per instances and is 
therefore using all the NICs.

I think Ralph covered the process mapping / hostfile discussion, so I have 
nothing to add there, other than to point out that all the process / NIC 
mapping happens independently of the hostfile.  The big potential gotcha is 
that the NIC selection algorithm will only give reliable results if you pin 
processes to the socket or smaller.  Pinning to the socket is the default 
behavior for Open MPI, so that should not be a problem.  But if you start 
changing the process pinning behavior, please do remember that you can have 
impact on how Open MPI does NIC selection.  With the multiple PCI switches and 
some of the behaviors of the Intel root complex, you really don't want to be 
driving traffic from one socket to a EFA device attached to another socket on 
the P4 platform.

Hope this helps,

Brian

On 6/11/21, 6:43 AM, "users on behalf of Ralph Castain via users" 
 wrote:

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



You can still use "map-by" to get what you want since you know there are 
four interfaces per node - just do "--map-by ppr:8:node". Note that you 
definitely do NOT want to list those multiple IP addresses in your hostfile - 
all you are doing is causing extra work for mpirun as it has to DNS resolve 
those addresses back down to their common host. We then totally ignore the fact 
that you specified those addresses, so it is accomplishing nothing (other than 
creating extra work).

You'll need to talk to AWS about how to drive striping across the 
interfaces. It sounds like they are automatically doing it, but perhaps not 
according to the algorithm you are seeking (i.e., they may not make such a 
linear assignment as you describe).


> On Jun 8, 2021, at 1:23 PM, John Moore via users 
 wrote:
>
> Hello,
>
> I am trying to run OpenMPI on AWSs new p4d instances. These instances 
have 4x 100Gb/s network interfaces, each with their own ipv4 address.
>
> I am primarily testing the bandwidth with the osu_micro_benchmarks test 
suite. Specifically I am running the osu_bibw and osu_mbw_mr tests to calculate 
the peak aggregate bandwidth I can achieve between two instances.
>
> I have found that running the osu_biwb test can only obtain the achieved 
throughput of one network interface (100 Gb/s).  This is the command I am using:
> /opt/amazon/openmpi/bin/mpirun -v -x FI_EFA_USE_DEVICE_RDMA=1 -x 
FI_PROVIDER="efa" -np 2 -host host1,host2 --map-by node --mca btl_baes_verbose 
30 --mca btl tcp,self --mca btl_tcp_if_exclude lo,do\cker0  ./osu_bw -m 4000
>
> As far as I understand it, openmpi should be detecting the four 
interfaces and striping data across them, correct?
>
> I have found that the osu_mbw_mr test can achieve 4x the bandwidth of a 
single network interface, if the configuration is correct. For example, I am 
using the following command:
> /opt/amazon/openmpi/bin/mpirun -v -x FI_EFA_USE_DEVICE_RDMA=1 -x 
FI_PROVIDER="efa" -np 8 -hostfile hostfile5 --map-by node --mca 
btl_baes_verbose 30 --mca btl tcp,self --mca btl_tcp_if_exclude lo,d\ocker0  
./osu_mbw_mr
> This will run four pairs of send/recv calls across the different nodes. 
hostfile5 contains all 8 local ipv4 addresses associated with the four nodes. I 
believe this is why I am getting the expected performance.
>
> So, now I want to runa real use case, but I can't use --map-by node. I 
want to run two ranks per ipv4 address (interface) with the ranks ordered 
sequentially according to the hostfile (the first 8 ranks will 

[OMPI users] Open MPI release update

2020-06-15 Thread Barrett, Brian via users
Greetings -

As you may know, Open MPI 5.0 is going to include an ambitious improvement in 
Open MPI's runtime system along with a number of performance improvements, and 
was targeted to ship this summer.  While we are still going to make those 
improvements to our runtime system, it is taking us longer than we anticipated 
and we want to make sure we get everything right before releasing Open MPI 5.0 
later this year.

To get many of the performance improvements to the MPI layer in your hands 
sooner, the Open MPI development team recently made the decision to start a 4.1 
release series.  Open MPI 4.1.0 will be based off of Open MPI 4.0.4, but with a 
number of significant performance improvements backported.  This includes 
improved collective routine performance, the OFI BTL to support one-sided 
operations over the Libfabric software stack, MPI I/O improvements, and general 
performance improvements.  Nightly tarballs of the 4.1 branch are available at 
https://www.open-mpi.org/nightly/v4.1.x/ and we plan on releasing Open MPI 
4.1.0 in July of 2020.

Thank you,

The Open MPI Development Team



Re: [OMPI users] disabling ucx over omnipath

2019-11-15 Thread Barrett, Brian via users
What you're asking for is an ugly path of interconnected dependencies between 
products owned by different companies.  It also completely blows any object 
model we can think of out of the water.  It's all bad in the general case.  The 
best we've come up with for the Libfabric MTL is to disable itself if it only 
finds a blacklisted provider, and then blacklist the VERBS and TCP providers.  
UCX could maybe do this, but it's had when we're talking VERBS.

I think you actually found the right solution for your network.  You know what 
you want to have happen, so set the priorities accordingly.  A default params 
config file bumping up the PSM2 MTL priority sounds like exactly what I would 
suggest as a workaround.

Brian

-Original Message-
From: users  on behalf of Brice Goglin via 
users 
Reply-To: Open MPI Users 
Date: Friday, November 15, 2019 at 1:55 AM
To: "users@lists.open-mpi.org" 
Cc: Brice Goglin , Ludovic Courtès 

Subject: [OMPI users] disabling ucx over omnipath

Hello

We have a platform with an old MLX4 partition and another OPA partition.
We want a single OMPI installation working for both kinds of nodes. When
we enable UCX in OMPI for MLX4, UCX ends up being used on the OPA
partition too, and the performance is poor (3GB/s instead of 10). The
problem seems to be that UCX gets enabled because they added support for
OPA in UCX 1.6 even that's just poor OPA support through Verbs.

The only solution we found is to bump the mtl_psm2_priority to 52 so
that PSM2 gets used before PML UCX. Seems to work fine but I am not sure
it's a good idea. Could OMPI rather tell UCX to disable itself when it
only finds OPA?

Thanks

Brice





Re: [OMPI users] Limit to number of asynchronous sends/receives?

2018-12-17 Thread Barrett, Brian via users
Adam -

There are a couple of theoretical limits on how many requests you can have 
outstanding (at some point, you will run the host out of memory).  However, 
those issues should be a problem when posting the MPI_Isend or MPI_Irecv, not 
during MPI_Waitall.  2.1.0 is pretty old; the first step in further debugging 
is to upgrade to one of the recent releases (3.1.3 or 4.0.0) and verify that 
the bug still exists.

Brian

> On Dec 16, 2018, at 6:52 AM, Adam Sylvester  wrote:
> 
> I'm running OpenMPI 2.1.0 on RHEL 7 using TCP communication.  For the 
> specific run that's crashing on me, I'm running with 17 ranks (on 17 
> different physical machines).  I've got a stage in my application where ranks 
> need to transfer chunks of data where the size of each chunk is trivial (on 
> the order of 100 MB) compared to the overall imagery.  However, the chunks 
> are spread out across many buffers in a way that makes the indexing 
> complicated (and the memory is not all within a single buffer)... the 
> simplest way to express the data movement in code is by a large number of 
> MPI_Isend() and MPI_Ireceive() calls followed of course by an eventual 
> MPI_Waitall().  This works fine for many cases, but I've run into a case now 
> where the chunks are imbalanced such that a few ranks have a total of ~450 
> MPI_Request objects (I do a single MPI_Waitall() with all requests at once) 
> and the remaining ranks have < 10 MPI_Requests.  In this scenario, I get a 
> seg fault inside PMPI_Waitall().
> 
> Is there an implementation limit as to how many asynchronous requests are 
> allowed?  Is there a way this can be queried either via a #define value or 
> runtime call?  I probably won't go this route, but when initially compiling 
> OpenMPI, is there a configure option to increase it?
> 
> I've done a fair amount of debugging and am pretty confident this is where 
> the error is occurring as opposed to indexing out of bounds somewhere, but if 
> there is no such limit in OpenMPI, that would be useful to know too.
> 
> Thanks.
> -Adam
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] RDMA over Ethernet in Open MPI - RoCE on AWS?

2018-09-27 Thread Barrett, Brian via users
On Sep 11, 2018, at 10:46 AM, Benjamin Brock 
mailto:br...@cs.berkeley.edu>> wrote:

Thanks for your response.

One question: why would RoCE still requiring host processing of every packet? I 
thought the point was that some nice server Ethernet NICs can handle RDMA 
requests directly?  Or am I misunderstanding RoCE/how Open MPI's RoCE transport?

Sorry, I missed your follow-up question.

There’s nothing that says that RoCE *must* be implemented in the NIC.  It is 
entirely possible to write a host-side kernel driver to implement the RoCE 
protocol.  My point was that if you were to do this, you wouldn’t have any of 
the benefits that people expect with RoCE, but the protocol would work just 
fine.  Similar to how you can write a VERBS implementation over DPDK and run 
the entire protocol in user space (https://github.com/zrlio/urdma).  While I 
haven’t tested the urdma package, both the Intel 82599 and ENA support DPDK, so 
if what you’re looking for is a VERBS stack, that might be one option.  
Personally, I’d just use Open MPI over TCP if Open MPI is your goal, because 
that sounds like a lot of headaches in configuration.

Brian
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] How do I build 3.1.0 (or later) with mellanox's libraries

2018-09-19 Thread Barrett, Brian via users
Yeah, there’s no good answer here from an “automatically do the right thing” 
point of view.  The reachable:netlink component (which is used for the TCP BTL) 
only works with libnl-3 because libnl-1 is a real pain to deal with if you’re 
trying to parse route behaviors.  It will do the right thing if you’re using 
OpenIB (the other place the libnl-1/libnl-3 thing comes into play) because 
OpenIB runs its configure test before reachable:netlink, but UCX’s tests run 
way later (for reasons that aren’t fixable).

Mellanox should really update everything to use libnl3 so that there’s at least 
hope of getting the right answer (not just in Open MPI, but in general; libnl-1 
is old and not awesome).  In the mean time, I *think* you can work around this 
problem via two paths.  First, which I know will work, is to remove the libnl-3 
devel package.  That’s probably not optimal for obvious reasons.  The second is 
to specify --enable-mca-no-build=reachable-netlink, which will disable the 
component that is preferring libnl-3 and then UCX should be happy.

Hope this helps,

Brian

> On Sep 19, 2018, at 9:12 AM, Jeff Squyres (jsquyres) via users 
>  wrote:
> 
> Alan --
> 
> Sorry for the delay.
> 
> I agree with Gilles: Brian's commit had to do with "reachable" plugins in 
> Open MPI -- they do not appear to be the problem here.
> 
> From the config.log you sent, it looks like configure aborted because you 
> requested UCX support (via --with-ucx) but configure wasn't able to find it.  
> And it looks like it didn't find it because of libnl v1 vs. v3 issues, as you 
> stated.
> 
> I think we're going to have to refer you to Mellanox support on this one.  
> The libnl situation is kind of a nightmare: your entire stack must be 
> compiled for either libnl v1 *or* v3.  If you have both libnl v1 *and* v3 
> appear in a process together, the process will crash before main() even 
> executes.  :-(  This is precisely why we have these warnings in Open MPI's 
> configure.
> 
> 
> 
> 
>> On Sep 14, 2018, at 4:35 PM, Alan Wild  wrote:
>> 
>> As request I've attached the config.log.  I also included the output from 
>> configure itself.
>> 
>> -Alan
>> 
>> On Fri, Sep 14, 2018, 10:20 AM Alan Wild  wrote:
>> I apologize if this has been discussed before but I've been unable to find 
>> discussion on the topic.
>> 
>> I recently went to build 3.1.2 on our cluster only to have the build 
>> completely fail during configure due to issues with libnl versions.
>> 
>> Specifically I was had requested support for  mellanox's libraries (mxm, 
>> hcoll, sharp, etc) which was fine for me in 3.0.0 and 3.0.1.  However it 
>> appears all of those libraries are built with libnl version 1 but the 
>> netlink component is now requiring netlink version 3 and aborts the build if 
>> it finds anything else in LIBS that using version 1.
>> 
>> I don't believe mellanox's is providing releases of these libraries linked 
>> agsinst liblnl version 3 (love to find out I'm wrong on that) at least not 
>> for CentOS 6.9.
>> 
>> According to github, it appears bwbarret's commit a543e7f (from one year ago 
>> today) which was merged into 3.1.0 is responsible.  However I'm having a 
>> hard time believing that openmpi would want to break support for these 
>> libraries or there isn't some other kind of workaround.
>> 
>> I'm on a short timeline to deliver this build of openmpi to my users but I 
>> know they won't accept a build that doesn't support mellanox's libraries.
>> 
>> Hoping there's an easy fix here (short of trying to reverse the commit in my 
>> build) that I'm overlooking here.
>> 
>> Thanks,
>> 
>> -Alan
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] RDMA over Ethernet in Open MPI - RoCE on AWS?

2018-09-10 Thread Barrett, Brian via users
It sounds like what you’re asking is “how do I get the best performance from 
Open MPI in AWS?”.

The TCP BTL is your best option for performance in AWS.  RoCE is going to be a 
bunch of work to get setup, and you’ll still end up with host processing of 
every packet.  There are a couple simple instance tweaks that can make a big 
difference.  AWS has published a very nice guide for setting up an EDA workload 
environment [1], which has a number of useful tweaks, particularly if you’re 
using C4 or earlier compute instances.  The biggest improvement, however, is to 
make sure you’re using a version of Open MPI newer than 2.1.2.  We fixed some 
fairly serious performance issues in the Open MPI TCP stack (that, humorously 
enough, were also in the MPICH TCP stack and have been fixed there as well) in 
2.1.2.

Given that your application is fairly asynchronous, you might want to 
experiment with the btl_tcp_progress_thread MCA parameter.  If your application 
benefits from asynchronous progress, using a progress thread might be the best 
option.

Brian

> On Sep 6, 2018, at 7:10 PM, Benjamin Brock  wrote:
> 
> I'm setting up a cluster on AWS, which will have a 10Gb/s or 25Gb/s Ethernet 
> network.  Should I expect to be able to get RoCE to work in Open MPI on AWS?
> 
> More generally, what optimizations and performance tuning can I do to an Open 
> MPI installation to get good performance on an Ethernet network?
> 
> My codes use a lot of random access AMOs and asynchronous block transfers, so 
> it seems to me like setting up RDMA over Ethernet would be essential to 
> getting good performance, but I can't seem to find much information about it 
> online.
> 
> Any pointers you have would be appreciated.
> 
> Ben
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Open MPI v3.1.0 Released

2018-05-07 Thread Barrett, Brian via users
The Open MPI Team, representing a consortium of research, academic, and 
industry partners, is pleased to announce the release of Open MPI version 
3.1.0. 

v3.1.0 is the start of a new release series for Open MPI.  New Features include 
a monitoring framework to track data movement in MPI operations, support for 
the MPI communicator tags, and direct support for one-sided over UCX.  The 
embedded PMIx runtime has been updated to 2.1.1.  There have been numerous 
other bug fix and performance improvements.  Version 3.0.0 can be downloaded 
from the main Open MPI web site:

  https://www.open-mpi.org/software/ompi/v3.1/

NEWS:

- Various OpenSHMEM bug fixes.
- Properly handle array_of_commands argument to Fortran version of
  MPI_COMM_SPAWN_MULTIPLE.
- Fix bug with MODE_SEQUENTIAL and the sharedfp MPI-IO component.
- Use "javac -h" instead of "javah" when building the Java bindings
  with a recent version of Java.
- Fix mis-handling of jostepid under SLURM that could cause problems
  with PathScale/OmniPath NICs.
- Disable the POWER 7/BE block in configure.  Note that POWER 7/BE is
  still not a supported platform, but it is no longer automatically
  disabled.  See
  https://github.com/open-mpi/ompi/issues/4349#issuecomment-374970982
  for more information.
- The output-filename option for mpirun is now converted to an
  absolute path before being passed to other nodes.
- Add monitoring component for PML, OSC, and COLL to track data
  movement of MPI applications.  See
  ompi/mca/commmon/monitoring/HowTo_pml_monitoring.tex for more
  information about the monitoring framework.
- Add support for communicator assertions: mpi_assert_no_any_tag,
  mpi_assert_no_any_source, mpi_assert_exact_length, and
  mpi_assert_allow_overtaking.
- Update PMIx to version 2.1.1.
- Update hwloc to 1.11.7.
- Many one-sided behavior fixes.
- Improved performance for Reduce and Allreduce using Rabenseifner's algorithm.
- Revamped mpirun --help output to make it a bit more manageable.
- Portals4 MTL improvements: Fix race condition in rendezvous protocol and
  retry logic.
- UCX OSC: initial implementation.
- UCX PML improvements: add multi-threading support.
- Yalla PML improvements: Fix error with irregular contiguous datatypes.
- Openib BTL: disable XRC support by default.
- TCP BTL: Add check to detect and ignore connections from processes
  that aren't MPI (such as IDS probes) and verify that source and
  destination are using the same version of Open MPI, fix issue with very
  large message transfer.
- ompi_info parsable output now escapes double quotes in values, and
  also quotes values can contains colons.  Thanks to Lev Givon for the
  suggestion.
- CUDA-aware support can now handle GPUs within a node that do not
  support CUDA IPC.  Earlier versions would get error and abort.
- Add a mca parameter ras_base_launch_orted_on_hn to allow for launching
  MPI processes on the same node where mpirun is executing using a separate
  orte daemon, rather than the mpirun process.   This may be useful to set to
  true when using SLURM, as it improves interoperability with SLURM's signal
  propagation tools.  By default it is set to false, except for Cray XC systems.
- Remove LoadLeveler RAS support.
- Remove IB XRC support from the OpenIB BTL due to lack of support.
- Add functionality for IBM s390 platforms.  Note that regular
  regression testing does not occur on the s390 and it is not
  considered a supported platform.
- Remove support for big endian PowerPC.
- Remove support for XL compilers older than v13.1.
- Remove support for atomic operations using MacOS atomics library.

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Q: Binding to cores on AWS?

2018-01-02 Thread Barrett, Brian via users
Jumping in a little late…  Today, EC2 instances don’t expose all the required 
information for memory pinning to work, which is why you see the warning.  The 
action-less error message is obviously a bit annoying (although it makes sense 
in the general case), but we haven’t had the time to work out the right balance 
between warning the user that Open MPI can’t bind memory (which is what many 
users expect when they bind to core) and annoying the user.

I’m sure that having nearly twice as much L3 cache will improve performance on 
HPC applications.  Depending on what your existing cluster looks like, it may 
also have more memory bandwidth compared to C4 instances.  And, of course, 
there’s the fact that C4 is 10 Gbps ethernet, vs. whatever you’re using in your 
existing cluster.  Finally, you didn’t mention which version of Open MPI you’re 
using.  If you’re using a version before 2.1.2, there were some antiquated 
choices for TCP buffer tuning parameters that were causing some fairly severe 
performance problems on EC2 instances in the multi-node parallel case.

Brian

On Dec 22, 2017, at 7:14 PM, Brian Dobbins 
> wrote:


Hi Gilles,

  You're right, we no longer get warnings... and the performance disparity 
still exists, though to be clear it's only in select parts of the code - others 
run as we'd expect.  This is probably why I initially guessed it was a 
process/memory affinity issue - the one timer I looked at is in a 
memory-intensive part of the code.  Now I'm wondering if we're still getting 
issues binding (I need to do a comparison with a local system), or if it could 
be due to the cache size differences - the AWS C4 instances have 25MB/socket, 
and we have 45MB/socket.  If we fit in cache on our system, and don't on 
theirs, that could account for things.  Testing that is next up on my list, 
actually.

  Cheers,
  - Brian


On Fri, Dec 22, 2017 at 7:55 PM, Gilles Gouaillardet 
> wrote:
Brian,

i have no doubt this was enough to get rid of the warning messages.

out of curiosity, are you now able to experience performances close to
native runs ?
if i understand correctly, the linux kernel allocates memory on the
closest NUMA domain (e.g. socket if i oversimplify), and since
MPI tasks are bound by orted/mpirun before they are execv'ed, i have
some hard time understanding how not binding MPI tasks to
memory can have a significant impact on performances as long as they
are bound on cores.

Cheers,

Gilles


On Sat, Dec 23, 2017 at 7:27 AM, Brian Dobbins 
> wrote:
>
> Hi Ralph,
>
>   Well, this gets chalked up to user error - the default AMI images come
> without the NUMA-dev libraries, so OpenMPI didn't get built with it (and in
> my haste, I hadn't checked).  Oops.  Things seem to be working correctly
> now.
>
>   Thanks again for your help,
>   - Brian
>
>
> On Fri, Dec 22, 2017 at 2:14 PM, r...@open-mpi.org 
> > wrote:
>>
>> I honestly don’t know - will have to defer to Brian, who is likely out for
>> at least the extended weekend. I’ll point this one to him when he returns.
>>
>>
>> On Dec 22, 2017, at 1:08 PM, Brian Dobbins 
>> > wrote:
>>
>>
>>   Hi Ralph,
>>
>>   OK, that certainly makes sense - so the next question is, what prevents
>> binding memory to be local to particular cores?  Is this possible in a
>> virtualized environment like AWS HVM instances?
>>
>>   And does this apply only to dynamic allocations within an instance, or
>> static as well?  I'm pretty unfamiliar with how the hypervisor (KVM-based, I
>> believe) maps out 'real' hardware, including memory, to particular
>> instances.  We've seen some parts of the code (bandwidth heavy) run ~10x
>> faster on bare-metal hardware, though, presumably from memory locality, so
>> it certainly has a big impact.
>>
>>   Thanks again, and merry Christmas!
>>   - Brian
>>
>>
>> On Fri, Dec 22, 2017 at 1:53 PM, r...@open-mpi.org 
>> >
>> wrote:
>>>
>>> Actually, that message is telling you that binding to core is available,
>>> but that we cannot bind memory to be local to that core. You can verify the
>>> binding pattern by adding --report-bindings to your cmd line.
>>>
>>>
>>> On Dec 22, 2017, at 11:58 AM, Brian Dobbins 
>>> > wrote:
>>>
>>>
>>> Hi all,
>>>
>>>   We're testing a model on AWS using C4/C5 nodes and some of our timers,
>>> in a part of the code with no communication, show really poor performance
>>> compared to native runs.  We think this is because we're not binding to a
>>> core properly and thus not caching, and a quick 'mpirun --bind-to core
>>> hostname' does suggest issues with this on AWS:
>>>
>>> [bdobbins@head run]$ mpirun 

Re: [OMPI users] How can I measure synchronization time of MPI_Bcast()

2017-10-23 Thread Barrett, Brian via users
Gilles suggested your best next course of action; time the MPI_Bcast and 
MPI_Barrier calls and see if there’s a non-linear scaling effect as you 
increase group size.

You mention that you’re using m3.large instances; while this isn’t the list for 
in-depth discussion about EC2 instances (the AWS Forums are better for that), 
I’ll note that unless you’re tied to m3 for organizational or reserved instance 
reasons, you’ll probably be happier on another instance type.  m3 was one of 
the last instance families released which does not support Enhanced Networking. 
 There’s significantly more jitter and latency in the m3 network stack compared 
to platforms which support Enhanced Networking (including the m4 platform).  If 
networking costs are causing your scaling problems, the first step will be 
migrating instance types.

Brian

> On Oct 23, 2017, at 4:19 AM, Gilles Gouaillardet 
>  wrote:
> 
> Konstantions,
> 
> A simple way is to rewrite MPI_Bcast() and insert timer and
> PMPI_Barrier() before invoking the real PMPI_Bcast().
> time spent in PMPI_Barrier() can be seen as time NOT spent on actual
> data transmission,
> and since all tasks are synchronized upon exit, time spent in
> PMPI_Bcast() can be seen as time spent on actual data transmission.
> this is not perfect, but this is a pretty good approximation.
> You can add extra timers so you end up with an idea of how much time
> is spent in PMPI_Barrier() vs PMPI_Bcast().
> 
> Cheers,
> 
> Gilles
> 
> On Mon, Oct 23, 2017 at 4:16 PM, Konstantinos Konstantinidis
>  wrote:
>> In any case, do you think that the time NOT spent on actual data
>> transmission can impact the total time of the broadcast especially when
>> there are so many groups that communicate (please refer to the numbers I
>> gave before if you want to get an idea).
>> 
>> Also, is there any way to quantify this impact i.e. to measure the time not
>> spent on actual data transmissions?
>> 
>> Kostas
>> 
>> On Fri, Oct 20, 2017 at 10:32 PM, Jeff Hammond 
>> wrote:
>>> 
>>> Broadcast is collective but not necessarily synchronous in the sense you
>>> imply. If you broadcast message size under the eager limit, the root may
>>> return before any non-root processes enter the function. Data transfer may
>>> happen prior to processes entering the function. Only rendezvous forces
>>> synchronization between any two processes but there may still be asynchrony
>>> between different levels of the broadcast tree.
>>> 
>>> Jeff
>>> 
>>> On Fri, Oct 20, 2017 at 3:27 PM Konstantinos Konstantinidis
>>>  wrote:
 
 Hi,
 
 I am running some tests on Amazon EC2 and they require a lot of
 communication among m3.large instances.
 
 I would like to give you an idea of what kind of communication takes
 place. There are 40 m3.large instances. Now, 28672 groups of 5 instances 
 are
 formed in a specific manner (let's skip the details). Within each group,
 each instance broadcasts some unsigned char data to the other 4 instances 
 in
 the group. So within each group, exactly 5 broadcasts take place.
 
 The problem is that if I increase the size of the group from 5 to 10
 there is significant delay in terms of transmission rate while, based on
 some theoretical results, this is not reasonable.
 
 I want to check if one of the reasons that this is happening is due to
 the time needed for the instances to synchronize when they call MPI_Bcast()
 since it's a collective function. As far as I know, all of the machines in
 the broadcast need to call it and then synchronize until the actual data
 transfer starts. Is there any way to measure this synchronization time?
 
 The code is in C++ and the MPI installed is described in the attached
 file.
 ___
 users mailing list
 users@lists.open-mpi.org
 https://lists.open-mpi.org/mailman/listinfo/users
>>> 
>>> --
>>> Jeff Hammond
>>> jeff.scie...@gmail.com
>>> http://jeffhammond.github.io/
>> 
>> 
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Open MPI v3.0.0 released

2017-09-12 Thread Barrett, Brian via users
The Open MPI Team, representing a consortium of research, academic, and 
industry partners, is pleased to announce the release of Open MPI version 3.0.0.

v3.0.0 is the start of a new release series for Open MPI.  Open MPI 3.0.0 
enables MPI_THREAD_MULTIPLE by default, so a build option to Open MPI is no 
longer required to enable thread support.  Additionally, the embedded PMIx 
runtime has been updated to 2.1.0 and the embedded hwloc has been updated to 
1.11.7.  There have been numerous other bug fix and performance improvements.  
Version 3.0.0 can be downloaded from the main Open MPI web site:

  https://www.open-mpi.org/software/ompi/v3.0/

NEWS:

Major new features:

- Use UCX allocator for OSHMEM symmetric heap allocations to optimize intra-node
  data transfers.  UCX SPML only.
- Use UCX multi-threaded API in the UCX PML.  Requires UCX 1.0 or later.
- Added support for Flux PMI
- Update embedded PMIx to version 2.1.0
- Update embedded hwloc to version 1.11.7

Changes in behavior compared to prior versions:

- Per Open MPI's versioning scheme (see the README), increasing the
  major version number to 3 indicates that this version is not
  ABI-compatible with prior versions of Open MPI. In addition, there may
  be differences in MCA parameter names and defaults from previous releases.
  Command line options for mpirun and other commands may also differ from
  previous versions. You will need to recompile MPI and OpenSHMEM applications
  to work with this version of Open MPI.
- With this release, Open MPI supports MPI_THREAD_MULTIPLE by default.
- New configure options have been added to specify the locations of libnl
  and zlib.
- A new configure option has been added to request Flux PMI support.
- The help menu for mpirun and related commands is now context based.
  "mpirun --help compatibility" generates the help menu in the same format
  as previous releases.

Removed legacy support:
- AIX is no longer supported.
- Loadlever is no longer supported.
- OpenSHMEM currently supports the UCX and MXM transports via the ucx and ikrit
  SPMLs respectively.
- Remove IB XRC support from the OpenIB BTL due to lack of support.
- Remove support for big endian PowerPC.
- Remove support for XL compilers older than v13.1

Known issues:

- MPI_Connect/accept between applications started by different mpirun
  commands will fail, even if ompi-server is running.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] --enable-builtin-atomics

2017-08-01 Thread Barrett, Brian via users
Well, if you’re trying to get Open MPI running on a platform for which we don’t 
have atomics support, built-in atomics solves a problem for you…

Brian

> On Aug 1, 2017, at 9:42 AM, Nathan Hjelm  wrote:
> 
> So far only cons. The gcc and sync builtin atomic provide slower performance 
> on x86-64 (and possible other platforms). I plan to investigate this as part 
> of the investigation into requiring C11 atomics from the C compiler.
> 
> -Nathan
> 
> 
>> On Aug 1, 2017, at 10:34 AM, Dave Love  wrote:
>> 
>> What are the pros and cons of configuring with --enable-builtin-atomics?
>> I haven't spotted any discussion of the option.
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Network performance over TCP

2017-07-12 Thread Barrett, Brian via users
Adam -

The btl_tcp_links flag does not currently work (for various reasons) in the 2.x 
and 3.x series.  It’s on my todo list to fix, but I’m not sure it will get done 
before the 3.0.0 release.  Part of the reason that it hasn’t been a priority is 
that most applications (outside of benchmarks) don’t benefit from the 20 Gbps 
between rank pairs, as they are generally talking to multiple peers at once 
(and therefore can drive the full 20 Gbps).  It’s definitely on our roadmap, 
but can’t promise a release just yet.

Brian

On Jul 12, 2017, at 11:44 AM, Adam Sylvester 
> wrote:

I switched over to X1 instances in AWS which have 20 Gbps connectivity.  Using 
iperf3, I'm seeing 11.1 Gbps between them with just one port.  iperf3 supports 
a -P option which will connect using multiple ports...  Setting this to use in 
the range of 5-20 ports (there's some variability from run to run), I can get 
in the range of 18 Gbps aggregate which for a real world speed seems pretty 
good.

Using mpirun with the previously-suggested btl_tcp_sndbuf and btl_tcp_rcvbuf 
settings, I'm getting around 10.7 Gbps.  So, pretty close to iperf with just 
one port (makes sense there'd be some overhead with MPI).  My understanding of 
the btl_tcp_links flag that Gilles mentioned is that it should be analogous to 
iperf's -P flag - it should connect with multiple ports in the hopes of 
improving the aggregate bandwidth.

If that's what this flag is supposed to do, it does not appear to be working 
properly for me.  With lsof, I can see the expected number of ports show up 
when I run iperf.  However, with MPI I only ever see three connections between 
the two machines - sshd, orted, and my actual application.  No matter what I 
set btl_tcp_links to, I don't see any additional ports show up (or any change 
in performance).

Am I misunderstanding what this flag does or is there a bug here?  If I am 
misunderstanding the flag's intent, is there a different flag that would allow 
Open MPI to use multiple ports similar to what iperf is doing?

Thanks.
-Adam

On Mon, Jul 10, 2017 at 9:31 PM, Adam Sylvester 
> wrote:
Thanks again Gilles.  Ahh, better yet - I wasn't familiar with the config file 
way to set these parameters... it'll be easy to bake this into my AMI so that I 
don't have to set them each time while waiting for the next Open MPI release.

Out of mostly laziness I try to keep to the formal releases rather than 
applying patches myself, but thanks for the link to it (the commit comments 
were useful to understand why this improved performance).

-Adam

On Mon, Jul 10, 2017 at 12:04 AM, Gilles Gouaillardet 
> wrote:
Adam,


Thanks for letting us know your performance issue has been resolved.


yes, https://www.open-mpi.org/faq/?category=tcp is the best place to look for 
this kind of information.

i will add a reference to these parameters. i will also ask folks at AWS if 
they have additional/other recommendations.


note you have a few options before 2.1.2 (or 3.0.0) is released :


- update your system wide config file (/.../etc/openmpi-mca-params.conf) or 
user config file

  ($HOME/.openmpi/mca-params.conf) and add the following lines

btl_tcp_sndbuf = 0

btl_tcp_rcvbuf = 0


- add the following environment variable to your environment

export OMPI_MCA_btl_tcp_sndbuf=0

export OMPI_MCA_btl_tcp_rcvbuf=0


- use Open MPI 2.0.3


- last but not least, you can manually download and apply the patch available at

https://github.com/open-mpi/ompi/commit/b64fedf4f652cadc9bfc7c4693f9c1ef01dfb69f.patch


Cheers,

Gilles

On 7/9/2017 11:04 PM, Adam Sylvester wrote:
Gilles,

Thanks for the fast response!

The --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 flags you recommended made a 
huge difference - this got me up to 5.7 Gb/s! I wasn't aware of these flags... 
with a little Googling, is https://www.open-mpi.org/faq/?category=tcp the best 
place to look for this kind of information and any other tweaks I may want to 
try (or if there's a better FAQ out there, please let me know)?
There is only eth0 on my machines so nothing to tweak there (though good to 
know for the future). I also didn't see any improvement by specifying more 
sockets per instance. But, your initial suggestion had a major impact.
In general I try to stay relatively up to date with my Open MPI version; I'll 
be extra motivated to upgrade to 2.1.2 so that I don't have to remember to set 
these --mca flags on the command line. :o)
-Adam

On Sun, Jul 9, 2017 at 9:26 AM, Gilles Gouaillardet 
 
>> 
wrote:

Adam,

at first, you need to change the default send and receive socket
buffers :
mpirun --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 ...
/* note this will be the default from Open 

Re: [OMPI users] [OMPI USERS] Jumbo frames

2017-05-05 Thread Barrett, Brian via users
But in many ways, it’s also not helpful to change the MTU from Open MPI.  It 
sounds like you made a bunch of changes all at once; I’d break them down and 
build up.  MTU is a very system-level configuration.  Use a tcp transmission 
test (iperf, etc.) to make sure TCP connections work between the nodes.  Once 
that’s working, you can start with Open MPI.  While Open MPI doesn’t have a way 
to set MTU, it can adjust how big the messages it passes to write() are, which 
will result in the same thing if the system is well configured.  In particular, 
you can start with moving around the eager frag limit, although that can have 
impact on memory consumption.

But, as I said, first thing is to get your operating system and networking gear 
set up properly.  It sounds like you’re not quite there yet, but it’s doubtful 
that this list will be the place to get help.

Brian

On May 5, 2017, at 7:41 AM, George Bosilca 
> wrote:

"ompi_info --param btl tcp -l 9" will give you all the TCP options. 
Unfortunately, OMPI does not support programatically changing the value of the 
MTU.

  George.

PS: We would be happy to receive contributions from the community.


On Fri, May 5, 2017 at 10:29 AM, Alberto Ortiz 
> wrote:
I am using version 1.10.6 on archlinux.
The option I should pass to mpirun should then be "-mca btl_tcp_mtu 13000"? 
Just to be sure.
Thank you,
Alberto

El 5 may. 2017 16:26, "r...@open-mpi.org" 
> escribió:
If you are looking to use TCP packets, then you want to set the send/recv 
buffer size in the TCP btl, not the openib one, yes?

Also, what version of OMPI are you using?

> On May 5, 2017, at 7:16 AM, Alberto Ortiz 
> > wrote:
>
> Hi,
> I have a program running with openMPI over a network using a gigabit switch. 
> This switch supports jumbo frames up to 13.000 bytes, so, in order to test 
> and see if it would be faster communicating with this frame lengths, I am 
> trying to use them with my program. I have set the MTU in each node to be 
> 13.000 but when running the program it doesn't even initiate, it gets 
> blocked. I have tried different lengths from 1.500 up to 13.000 but it 
> doesn't work with any length.
>
> I have searched and only found that I have to set OMPI with "-mca 
> btl_openib_ib_mtu 13000" or the length to be used, but I don't seem to get it 
> working.
>
> Which are the steps to get OMPI to use larger TCP packets length? Is it 
> possible to reach 13000 bytes instead of the standard 1500?
>
> Thank yo in advance,
> Alberto
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users