Re: [OMPI users] Seg error when using v5.0.1

2024-01-30 Thread Joseph Schuchart via users

Hello,

This looks like memory corruption. Do you have more details on what your 
app is doing? I don't see any MPI calls inside the call stack. Could you 
rebuild Open MPI with debug information enabled (by adding 
`--enable-debug` to configure)? If this error occurs on singleton runs 
(1 process) then you can easily attach gdb to it to get a better stack 
trace. Also, valgrind may help pin down the problem by telling you which 
memory block is being free'd here.


Thanks
Joseph

On 1/30/24 07:41, afernandez via users wrote:

Hello,
I upgraded one of the systems to v5.0.1 and have compiled everything 
exactly as dozens of previous times with v4. I wasn't expecting any 
issue (and the compilations didn't report anything out of ordinary) 
but running several apps has resulted in error messages such as:

/Backtrace for this error:/
/#0  0x7f7c9571f51f in ???/
/        at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0/
/#1  0x7f7c957823fe in __GI___libc_free/
/        at ./malloc/malloc.c:3368/
/#2  0x7f7c93a635c3 in ???/
/#3  0x7f7c95f84048 in ???/
/#4  0x7f7c95f1cef1 in ???/
/#5  0x7f7c95e34b7b in ???/
/#6  0x6e05be in ???/
/#7  0x6e58d7 in ???/
/#8  0x405d2c in ???/
/#9  0x7f7c95706d8f in __libc_start_call_main/
/        at ../sysdeps/nptl/libc_start_call_main.h:58/
/#10  0x7f7c95706e3f in __libc_start_main_impl/
/        at ../csu/libc-start.c:392/
/#11  0x405d64 in ???/
/#12  0x in ???/
OS is Ubuntu 22.04, OpenMPI was built with GCC13.2, and before 
building OpenMPI, I had previously built the hwloc (2.10.0) library at 
/usr/lib/x86_64-linux-gnu. Maybe I'm missing something pretty basic, 
but the problem seems to be related to memory allocation.

Thanks.





Re: [OMPI users] A make error when build openmpi-5.0.0 using the gcc 14.0.0 (experimental) compiler

2023-12-19 Thread Joseph Schuchart via users
Thanks for the report Jorge! I opened a ticket to track the build issues 
with GCC-14: https://github.com/open-mpi/ompi/issues/12169


Hopefully we will have Open MPI build with GCC-14 before it is released.

Cheers,
Joseph

On 12/17/23 06:03, Jorge D'Elia via users wrote:

Hi there,

I already overcame this problem: simply using the gcc version (GCC) 
13.2.1 that
comes with the Fedora 39 distribution, the openmpi build is now fine 
again,

as it (almost) always is.

Greetings.
Jorge.

--
Jorge D'Elia via users  escribió:


Hi,

On a x86_64-pc-linux-gnu machine with Fedora 39:

$ uname -a
Linux amaral 6.6.6-200.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Dec 11 
17:29:08 UTC 2023 x86_64 GNU/Linux


and using:

$ gcc --version
gcc (GCC) 14.0.0 20231216 (experimental)

we tried to upgrade to the openmpi distribution:

41968409 Dec 16 09:15 openmpi-5.0.0.tar.gz

using the configuration flags (already used in previous versions of 
openmpi):


$ ../configure --enable-ipv6 --enable-sparse-groups --enable-mpi-ext 
--enable-oshmem --with-libevent=internal --with-hwloc=internal 
--with-ucx --with-pmix=internal --without-libfabric 
--prefix=${PREFIX} 2>&1 | tee configure.eco


$ make -j4 all 2>&1 | tee make-all.eco

but, today, we have the following make error:

...
/home/bigpack/openmpi-paq/openmpi-5.0.0/3rd-party/openpmix/include/pmix_deprecated.h:851:32: 
error: passing argument 2 of ‘PMIx_Data_buffer_unload’ from 
incompatible pointer type [-Wincompatible-pointer-types]

PMIx_Data_buffer_unload(b, &(d), &(s)) void **

We attached the configure.echo and make-all.echo files in a *.tgz 
compressed file.


Please, some clue in order fo fix? Thanks in advance.

Regards.
Jorge D'Elia.--






Re: [OMPI users] MPI_Get is slow with structs containing padding

2023-03-30 Thread Joseph Schuchart via users

Hi Antoine,

That's an interesting result. I believe the problem with datatypes with 
gaps is that MPI is not allowed to touch the gaps. My guess is that for 
the RMA version of the benchmark the implementation either has to revert 
back to an active message packing the data at the target and sending it 
back or (which seems more likely in your case) transfer each object 
separately and skip the gaps. Without more information on your setup 
(using UCX?) and the benchmark itself (how many elements? what does the 
target do?) it's hard to be more precise.


A possible fix would be to drop the MPI datatype for the RMA use and 
transfer the vector as a whole, using MPI_BYTE. I think there is also a 
way to modify the upper bound of the MPI type to remove the gap, using 
MPI_TYPE_CREATE_RESIZED. I expect that that will allow MPI to touch the 
gap and transfer the vector as a whole. I'm not sure about the details 
there, maybe someone can shed some light.


HTH
Joseph

On 3/30/23 18:34, Antoine Motte via users wrote:


Hello everyone,

I recently had to code an MPI application where I send std::vector 
contents in a distributed environment. In order to try different 
approaches I coded both 1-sided and 2-sided point-to-point 
communication schemes, the first one uses MPI_Window and MPI_Get, the 
second one uses MPI_SendRecv.


I had a hard time figuring out why my implementation with MPI_Get was 
between 10 and 100 times slower, and I finally found out that MPI_Get 
is abnormally slow when one tries to send custom datatypes including 
padding.


Here is a short example attached, where I send a struct {double, int} 
(12 bytes of data + 4 bytes of padding) vs a struct {double, int, int} 
(16 bytes of data, 0 bytes of padding) with both MPI_SendRecv and 
MPI_Get. I got these results :


mpirun -np 4 ./compareGetWithSendRecv
{double, int} SendRecv : 0.0303547 s
{double, int} Get : 1.9196 s
{double, int, int} SendRecv : 0.0164659 s
{double, int, int} Get : 0.0147757 s

I run it with both Open MPI 4.1.2 and with intel MPI 2021.6 and got 
the same results.


Is this result normal? Do I have any solution other than adding 
garbage at the end of the struct or at the end of the MPI_Datatype to 
avoid padding?


Regards,

Antoine Motte





Re: [OMPI users] Tracing of openmpi internal functions

2022-11-16 Thread Joseph Schuchart via users

Arun,

You can use a small wrapper script like this one to store the perf data 
in separate files:


```
$ cat perfwrap.sh
#!/bin/bash
exec perf record -o perf.data.$OMPI_COMM_WORLD_RANK $@
```

Then do `mpirun -n  ./perfwrap.sh ./a.out` to run all processes under 
perf. You can also select a subset of processes based on 
$OMPI_COMM_WORLD_RANK.


HTH,
Joseph


On 11/16/22 09:24, Chandran, Arun via users wrote:

Hi Jeff,

Thanks, I will check flamegraphs.

Sample generation with perf could be a problem, I don't think I can do 'mpirun -np <> 
perf record ' and get
the sampling done on all the cores and store each cores data (perf.data) 
separately to analyze it. Is it possible to do?

Came to know that amduprof support individual sample collection for mpi app 
running on multiple cores,  need to investigate further on this.

--Arun

From: users  On Behalf Of Jeff Squyres 
(jsquyres) via users
Sent: Monday, November 14, 2022 11:34 PM
To: users@lists.open-mpi.org
Cc: Jeff Squyres (jsquyres) ; arun c 

Subject: Re: [OMPI users] Tracing of openmpi internal functions


Caution: This message originated from an External Source. Use proper caution 
when opening attachments, clicking links, or responding.

Open MPI uses plug-in modules for its implementations of the MPI collective 
algorithms.  From that perspective, once you understand that infrastructure, 
it's exactly the same regardless of whether the MPI job is using intra-node or 
inter-node collectives.

We don't have much in the way of detailed internal function call tracing inside 
Open MPI itself, due to performance considerations.  You might want to look 
into flamegraphs, or something similar...?

--
Jeff Squyres
mailto:jsquy...@cisco.com

From: users  on behalf of arun c via users 

Sent: Saturday, November 12, 2022 9:46 AM
To: mailto:users@lists.open-mpi.org 
Cc: arun c 
Subject: [OMPI users] Tracing of openmpi internal functions
  
Hi All,


I am new to openmpi and trying to learn the internals (source code
level) of data transfer during collective operations. At first, I will
limit it to intra-node (between cpu cores, and sockets) to minimize
the scope of learning.

What are the best options (Looking for only free and open methods) for
tracing the openmpi code? (say I want to execute alltoall collective
and trace all the function calls and event callbacks that happened
inside the libmpi.so on all the cores)

Linux kernel has something called ftrace, it gives a neat call graph
of all the internal functions inside the kernel with time, is
something similar available?

--Arun




Re: [OMPI users] MPI_THREAD_MULTIPLE question

2022-09-10 Thread Joseph Schuchart via users

Timesir,

It sounds like you're using the 4.0.x or 4.1.x release. The one-sided 
components were cleaned up in the upcoming 5.0.x release and the 
component in question (osc/pt2pt) was removed. You could also try to 
compile Open MPI 4.0.x/4.1.x against UCX and use osc/ucx (by passing 
`--mca osc ucx` to mpirun). In either case (using UCX or switching to 
5.0.x) you should be able to run MPI RMA codes without requiring an 
RDMA-capable network.


HTH
Joseph

On 9/10/22 06:55, mrlong336 via users wrote:

*mpirun reports the following error:
*The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this 
release.

Workarounds are to run on a single node, or to use a system with an RDMA
capable network such as Infiniband.

*Does this error mean that the network must support RDMA if it wants 
to run distributed? Will Gigabit/10 Gigabit Ethernet work?

*



Best regards,

*Timesir*
*mrlong...@gmail.com

*






[OMPI users] 1st Future of MPI RMA Workshop: Call for Short Talks and Participation

2022-05-29 Thread Joseph Schuchart via users

[Apologies if you got multiple copies of this email.]

*1st Future of MPI RMA Workshop (FoRMA'22)*

https://mpiwg-rma.github.io/forma22/

The MPI RMA Working Group is organizing a workshop aimed at gathering 
inputs from users and implementors of MPI RMA with past experiences and 
ideas for improvements, as well as hardware vendors to discuss future 
developments of high-performance network hardware relevant to MPI RMA. 
The goal is to evaluate the current design of MPI RMA, to learn from the 
current state of practice, and to start the process of rethinking the 
design of one-sided communication in the MPI standard.


The workshop will be held entirely virtual on *June 16 & 17* and the 
results and contributions will be combined into a joint white paper.


*Call for Short Talks*

We are seeking input from users of MPI RMA willing to share their 
experiences, results, and lessons learned in short talks of 10-15 
minutes. No stand-alone publications are required. In particular, the 
following topics for contributions are of interest:


- Successful and unsuccessful attempts of using MPI RMA to improve the 
communication efficiency of applications and middle-ware;
- Challenges in implementing and porting applications and middle-ware on 
top of MPI RMA;

- Features that are missing from MPI RMA.

If interested, please send a short email with the title of the talk 
toschuch...@icl.utk.edu.


*Call for Participation*

We encourage everyone interested in one-sided communication models in 
general and MPI RMA in particular to join this two day workshop and 
contribute to its success through questions and comments. We encourage a 
lively and open exchange of ideas and discussions between the speakers 
and the audience. The connections information will be posted before the 
workshop athttps://mpiwg-rma.github.io/forma22/. Please direct any 
questions toschuch...@icl.utk.edu.



*Registration*

To register for the workshop (and receive the access link) please use 
the registration site at 
https://tennessee.zoom.us/meeting/register/tJ0qduChrDgsGNdEG3MQeLB-DH3lZ6r-DZww. 
All participation is free.



*Organizing committee*

Joseph Schuchart, University of Tennessee, Knoxville
James Dinan, Nvidia Inc.
Bill Gropp, University of Illinois Urbana Champaign



Re: [OMPI users] Check equality of a value in all MPI ranks

2022-02-17 Thread Joseph Schuchart via users

Hi Niranda,

A pattern I have seen in several places is to allreduce the pair p = 
{-x,x} with MPI_MIN or MPI_MAX. If in the resulting pair p[0] == -p[1], 
then everyone has the same value. If not, at least one rank had a 
different value. Example:


```
bool is_same(int x) {
  int p[2];
  p[0] = -x;
  p[1] = x;
  MPI_Allreduce(MPI_IN_PLACE, p, 2, MPI_INT, MPI_MIN, MPI_COMM_WORLD);
  return (p[0] == -p[1]);
}
```

HTH,
Joseph

On 2/17/22 16:40, Niranda Perera via users wrote:

Hi all,

Say I have some int `x`. I want to check if all MPI ranks get the same 
value for `x`. What's a good way to achieve this using MPI collectives?


The simplest I could think of is, broadcast rank0's `x`, do the 
comparison, and allreduce-LAND the comparison result. This requires 
two collective operations.

```python
...
x = ... # each rank may produce different values for x
x_bcast = comm.bcast(x, root=0)
all_equal = comm.allreduce(x==x_bcast, op=MPI.LAND)
if not all_equal:
   raise Exception()
...
```
Is there a better way to do this?


--
Niranda Perera
https://niranda.dev/
@n1r44 





Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-11 Thread Joseph Schuchart via users
I am not aware of anything similar in Open MPI. Maybe OSU-INAM can work 
with other MPI implementations? Would be worth investigating...


Joseph

On 2/11/22 06:54, Bertini, Denis Dr. wrote:


Hi Joseph

Looking at the MVAPICH i noticed that, in this MPI implementation

a Infiniband Network Analysis  and Profiling Tool  is provided:


OSU-INAM


Is there something equivalent using openMPI ?

Best

Denis



*From:* users  on behalf of Joseph 
Schuchart via users 

*Sent:* Tuesday, February 8, 2022 4:02:53 PM
*To:* users@lists.open-mpi.org
*Cc:* Joseph Schuchart
*Subject:* Re: [OMPI users] Using OSU benchmarks for checking 
Infiniband network

Hi Denis,

Sorry if I missed it in your previous messages but could you also try
running a different MPI implementation (MVAPICH) to see whether Open MPI
is at fault or the system is somehow to blame for it?

Thanks
Joseph

On 2/8/22 03:06, Bertini, Denis Dr. via users wrote:
>
> Hi
>
> Thanks for all these informations !
>
>
> But i have to confess that in this multi-tuning-parameter space,
>
> i got somehow lost.
>
> Furthermore it is somtimes mixing between user-space and kernel-space.
>
> I have only possibility to act on the user space.
>
>
> 1) So i have on the system max locked memory:
>
>                         - ulimit -l unlimited (default )
>
>   and i do not see any warnings/errors related to that when 
launching MPI.

>
>
> 2) I tried differents algorithms for MPI_all_reduce op.  all showing
> drop in
>
> bw for size=16384
>
>
> 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed
>
> the same behaviour.
>
>
> 3) i realized that increasing the so-called warm up parameter  in the
>
> OSU benchmark (argument -x 200 as default) the discrepancy.
>
> At the contrary putting lower threshold ( -x 10 ) can increase this BW
>
> discrepancy up to factor 300 at message size 16384 compare to
>
> message size 8192 for example.
>
> So does it means that there are some caching effects
>
> in the internode communication?
>
>
> From my experience, to tune parameters is a time-consuming and 
cumbersome

>
> task.
>
>
> Could it also be the problem is not really on the openMPI
> implemenation but on the
>
> system?
>
>
> Best
>
> Denis
>
> 
> *From:* users  on behalf of Gus
> Correa via users 
> *Sent:* Monday, February 7, 2022 9:14:19 PM
> *To:* Open MPI Users
> *Cc:* Gus Correa
> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
> Infiniband network
> This may have changed since, but these used to be relevant points.
> Overall, the Open MPI FAQ have lots of good suggestions:
> https://www.open-mpi.org/faq/
> some specific for performance tuning:
> https://www.open-mpi.org/faq/?category=tuning
> https://www.open-mpi.org/faq/?category=openfabrics
>
> 1) Make sure you are not using the Ethernet TCP/IP, which is widely
> available in compute nodes:
> mpirun  --mca btl self,sm,openib  ...
>
> https://www.open-mpi.org/faq/?category=tuning#selecting-components
>
> However, this may have changed lately:
> https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable
> 2) Maximum locked memory used by IB and their system limit. Start
> here:
> 
https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage

> 3) The eager vs. rendezvous message size threshold. I wonder if it may
> sit right where you see the latency spike.
> https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user
> 4) Processor and memory locality/affinity and binding (please check
> the current options and syntax)
> https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4
>
> On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users
>  wrote:
>
> Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php
>
> mpirun --verbose --display-map
>
> Have you tried newer OpenMPI versions?
>
> Do you get similar behavior for the osu_reduce and osu_gather
> benchmarks?
>
> Typically internal buffer sizes as well as your hardware will affect
> performance. Can you give specifications similar to what is
> available at:
> http://mvapich.cse.ohio-state.edu/performance/collectives/
> where the operating system, switch, node type and memory are
> indicated.
>
> If you need good performance, may want to also specify the algorithm
> used. You can find some of the parameters you can tune using:
>
> ompi_info --all
>
> A particular helpful parameter is:
>
> MCA coll tuned: para

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-08 Thread Joseph Schuchart via users

Hi Denis,

Sorry if I missed it in your previous messages but could you also try 
running a different MPI implementation (MVAPICH) to see whether Open MPI 
is at fault or the system is somehow to blame for it?


Thanks
Joseph

On 2/8/22 03:06, Bertini, Denis Dr. via users wrote:


Hi

Thanks for all these informations !


But i have to confess that in this multi-tuning-parameter space,

i got somehow lost.

Furthermore it is somtimes mixing between user-space and kernel-space.

I have only possibility to act on the user space.


1) So i have on the system max locked memory:

                        - ulimit -l unlimited (default )

  and i do not see any warnings/errors related to that when launching MPI.


2) I tried differents algorithms for MPI_all_reduce op.  all showing 
drop in


bw for size=16384


4) I disable openIB ( no RDMA, ) and used only TCP,  and i noticed

the same behaviour.


3) i realized that increasing the so-called warm up parameter  in the

OSU benchmark (argument -x 200 as default) the discrepancy.

At the contrary putting lower threshold ( -x 10 ) can increase this BW

discrepancy up to factor 300 at message size 16384 compare to

message size 8192 for example.

So does it means that there are some caching effects

in the internode communication?


From my experience, to tune parameters is a time-consuming and cumbersome

task.


Could it also be the problem is not really on the openMPI 
implemenation but on the


system?


Best

Denis


*From:* users  on behalf of Gus 
Correa via users 

*Sent:* Monday, February 7, 2022 9:14:19 PM
*To:* Open MPI Users
*Cc:* Gus Correa
*Subject:* Re: [OMPI users] Using OSU benchmarks for checking 
Infiniband network

This may have changed since, but these used to be relevant points.
Overall, the Open MPI FAQ have lots of good suggestions:
https://www.open-mpi.org/faq/
some specific for performance tuning:
https://www.open-mpi.org/faq/?category=tuning
https://www.open-mpi.org/faq/?category=openfabrics

1) Make sure you are not using the Ethernet TCP/IP, which is widely 
available in compute nodes:

mpirun  --mca btl self,sm,openib  ...

https://www.open-mpi.org/faq/?category=tuning#selecting-components

However, this may have changed lately: 
https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable
2) Maximum locked memory used by IB and their system limit. Start 
here: 
https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage
3) The eager vs. rendezvous message size threshold. I wonder if it may 
sit right where you see the latency spike.

https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user
4) Processor and memory locality/affinity and binding (please check 
the current options and syntax)

https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4

On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users 
 wrote:


Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php

mpirun --verbose --display-map

Have you tried newer OpenMPI versions?

Do you get similar behavior for the osu_reduce and osu_gather
benchmarks?

Typically internal buffer sizes as well as your hardware will affect
performance. Can you give specifications similar to what is
available at:
http://mvapich.cse.ohio-state.edu/performance/collectives/
where the operating system, switch, node type and memory are
indicated.

If you need good performance, may want to also specify the algorithm
used. You can find some of the parameters you can tune using:

ompi_info --all

A particular helpful parameter is:

MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current
value: "ignore", data source: default, level: 5 tuner/detail,
type: int)
                           Which allreduce algorithm is used. Can be
locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping
(tuned
reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring
                           Valid values: 0:"ignore",
1:"basic_linear",
2:"nonoverlapping", 3:"recursive_doubling", 4:"ring",
5:"segmented_ring", 6:"rabenseifner"
           MCA coll tuned: parameter
"coll_tuned_allreduce_algorithm_segmentsize" (current value: "0",
data
source: default, level: 5 tuner/detail, type: int)

For OpenMPI 4.0, there is a tuning program [2] that might also be
helpful.

[1]

https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi
[2] https://github.com/open-mpi/ompi-collectives-tuning

On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote:
> Hi
>
> When i repeat i always got the huge discrepancy at the
>
> message size of 16384.
>
> May be there is a way to run mpi in verbose mode in order
>
> to further investigate this behaviour?
>
> Best
>
> Denis
>
>
   

Re: [OMPI users] OpenMPI and maker - Multiple messages

2021-02-18 Thread Joseph Schuchart via users

Thomas,

The post you are referencing suggests to run

mpiexec -mca btl ^openib -n 40 maker -help

but you are running

mpiexec -mca btl ^openib -N 5 gcc --version

which will run 5 instances of GCC. The out put you're seeing is totally 
to be expected.


I don't think anyone here can help you with getting maker to work 
correctly with MPI. I suggest you first check whether make is actually 
configured to use MPI. The test using maker (as suggested in the post) 
does not necessarily mean that Open MPI isn't working, it might also 
happen if maker is not correctly configured.


One way to check whether Open MPI is working correctly on your system is 
to use a simple MPI program that prints the world communicator size and 
rank. Any MPI hello world program you find online should do.


Cheers
Joseph

On 2/18/21 1:39 PM, Thomas Eylenbosch via users wrote:

Hello

We are trying to run 
maker(http://www.yandell-lab.org/software/maker.html) in combination 
with OpenMPI


But when we are trying to submit a job with the maker and openmpi,

We see the following error in the log file:

--Next Contig--

#-

Another instance of maker is processing *this*contig!!

SeqID: chrA10

Length: 17398227

#-

According to

http://gmod.827538.n3.nabble.com/Does-maker-support-muti-processing-for-a-single-long-fasta-sequence-using-openMPI-td4061342.html

We have to run the following command mpiexec -mca btl ^openib -n 40 
maker -help


“If you get a single help message then everything is fine.  If you get 
40 help messages, then MPI is not communicating correctly.”


We are using the following command to demonstrate what is going wrong:

mpiexec -mca btl ^openib -N 5 gcc --version

gcc (GCC) 10.2.0

Copyright (C) 2020 Free Software Foundation, Inc.

This is free software; see the source for copying conditions.  There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

gcc (GCC) 10.2.0

Copyright (C) 2020 Free Software Foundation, Inc.

This is free software; see the source for copying conditions.  There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

gcc (GCC) 10.2.0

Copyright (C) 2020 Free Software Foundation, Inc.

This is free software; see the source for copying conditions.  There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

gcc (GCC) 10.2.0

Copyright (C) 2020 Free Software Foundation, Inc.

This is free software; see the source for copying conditions.  There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

gcc (GCC) 10.2.0

Copyright (C) 2020 Free Software Foundation, Inc.

This is free software; see the source for copying conditions.  There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

So we are getting the message 5 times, which means OpenMPI is not 
correctly installed on our cluster?


We are using EasyBuild to build/install our OpenMPI module. ( with the 
default OpenMPI easyblock module)


Best regards / met vriendelijke groeten*
Thomas Eylenbosch*
DevOps Engineer (OnSite), Gluo N.V.

*Currently available at BASF Belgium Coordination Center CommV*
Email: thomas.eylenbo...@partners.basf.com 

Postal Address: BASF Belgium Coordination Center CommV, Technologiepark 
101, 9052 Gent Zwijnaarde, Belgium


BASF Belgium Coordination Center CommV

Scheldelaan 600, 2040 Antwerpen, België

RPR Antwerpen (afd. Antwerpen)

BTW BE0862.390.376

_www.basf.be _

Deutsche Bank AG

IBAN: BE43 8262 8044 4801

BIC: DEUTBEBE

Information on data protection can be found here: 
https://www.basf.com/global/en/legal/data-protection-at-basf.html




Re: [OMPI users] Issue with MPI_Get_processor_name() in Cygwin

2021-02-09 Thread Joseph Schuchart via users

Martin,

The name argument to MPI_Get_processor_name is a character string of 
length at least MPI_MAX_PROCESSOR_NAME, which in OMPI is 256. You are 
providing a character string of length 200, so OMPI is free to write 
past the end of your string and into some of your stack variables, hence 
you are "losing" the values of rank and size. The issue should be gone 
if you write `char hostName[MPI_MAX_PROCESSOR_NAME];`


Cheers
Joseph

On 2/9/21 9:14 PM, Martín Morales via users wrote:

Hello,

I have what it could be a memory corruption with 
/MPI_Get_processor_name()/ in Cygwin.


I’m using OMPI 4.1.0; I tried also in Linux (same OMPI version) but 
there isn’t an issue there.


Below the example of a trivial spawn operation. It has 2 scripts: 
spawned and spawner.


In the spawned script, if I move the /MPI_Get_processor_name()/ line 
below /MPI_Comm_size()/ I lose the values of /rank/ and /size/.


In fact, I declared some other variables in the /int hostName_len, rank, 
size;/ line and I lost them too.


Regards,

Martín

---

*Spawned:*

/#include "mpi.h"/

/#include /

/#include /

//

/int main(int argc, char ** argv){/

*/    int hostName_len,rank, size;/*

/    MPI_Comm parentcomm;/

/    char hostName[200];/

//

/    MPI_Init( NULL, NULL );/

/    MPI_Comm_get_parent(  );/

/*MPI_Get_processor_name(hostName, _len);*/

/   MPI_Comm_rank(MPI_COMM_WORLD, );/

/    MPI_Comm_size(MPI_COMM_WORLD, );/

//

/    if (parentcomm != MPI_COMM_NULL) {/

/  printf("I'm the spawned h: %s  r/s: %i/%i\n", hostName, rank, size);/

/    }/

//

/    MPI_Finalize();/

/    return 0;/

/}/

//

*Spawner:*

#include "mpi.h"

#include 

#include 

#include 

int main(int argc, char ** argv){

     int processesToRun;

     MPI_Comm intercomm;

   if(argc < 2 ){

     printf("Processes number needed!\n");

     return 0;

   }

   processesToRun = atoi(argv[1]);

       MPI_Init( NULL, NULL );

   printf("Spawning from parent:...\n");

   MPI_Comm_spawn( "./spawned", MPI_ARGV_NULL, processesToRun, 
MPI_INFO_NULL, 0, MPI_COMM_SELF, , MPI_ERRCODES_IGNORE);


 MPI_Finalize();

     return 0;

}

//



Re: [OMPI users] mpirun on Kubuntu 20.4.1 hangs

2020-10-22 Thread Joseph Schuchart via users

Hi Jorge,

Can you try to get a stack trace of mpirun using the following command 
in a separate terminal?


sudo gdb -batch -ex "thread apply all bt" -p $(ps -C mpirun -o pid= | 
head -n 1)


Maybe that will give some insight where mpirun is hanging.

Cheers,
Joseph

On 10/21/20 9:58 PM, Jorge SILVA via users wrote:

Hello Jeff,

The  program is not executed, seems waits for something to connect with 
(why twice ctrl-C ?)


jorge@gcp26:~/MPIRUN$ mpirun -np 1 touch /tmp/foo
^C^C

jorge@gcp26:~/MPIRUN$ ls -l /tmp/foo
ls: impossible d'accéder à '/tmp/foo': Aucun fichier ou dossier de ce type

no file  is created..

In fact, my question was if are there differences in mpirun usage  
between these versions..  The


mpirun -help

gives a different output as expected, but I  tried a lot of options 
without any success.



Le 21/10/2020 à 21:16, Jeff Squyres (jsquyres) a écrit :
There's huge differences between Open MPI v2.1.1 and v4.0.3 (i.e., 
years of development effort); it would be very hard to categorize them 
all; sorry!


What happens if you

    mpirun -np 1 touch /tmp/foo

(Yes, you can run non-MPI apps through mpirun)

Is /tmp/foo created?  (i.e., did the job run, and mpirun is somehow 
not terminating)




On Oct 21, 2020, at 12:22 PM, Jorge SILVA via users 
mailto:users@lists.open-mpi.org>> wrote:


Hello Gus,

 Thank you for your answer..  Unfortunately my problem is much more 
basic. I  didn't try to run the program in both computers , but just 
to run something in one computer. I just installed the new OS an 
openmpi in two different computers, in the standard way, with the 
same result.


For example:

In kubuntu20.4.1 LTS with openmpi 4.0.3-0ubuntu

jorge@gcp26:~/MPIRUN$ cat hello.f90
 print*,"Hello World!"
end
jorge@gcp26:~/MPIRUN$ mpif90 hello.f90 -o hello
jorge@gcp26:~/MPIRUN$ ./hello
 Hello World!
jorge@gcp26:~/MPIRUN$ mpirun -np 1 hello <---here  the program hangs 
with no output

^C^Cjorge@gcp26:~/MPIRUN$

The mpirun task sleeps with no output, and only twice ctrl-C ends the 
execution  :


jorge   5540  0.1  0.0 44768  8472 pts/8    S+   17:54   0:00 
mpirun -np 1 hello


In kubuntu 18.04.5 LTS with openmpi 2.1.1, of course, the same 
program gives


jorge@gcp30:~/MPIRUN$ cat hello.f90
 print*, "Hello World!"
 END
jorge@gcp30:~/MPIRUN$ mpif90 hello.f90 -o hello
jorge@gcp30:~/MPIRUN$ ./hello
 Hello World!
jorge@gcp30:~/MPIRUN$ mpirun -np 1 hello
 Hello World
jorge@gcp30:~/MPIRUN$


Even just typing mpirun hangs without the usual error message.

Are there any changes between the two versions of openmpi that I 
miss?  Some package lacking to mpirun ?


Thank you again for your help

Jorge


Le 21/10/2020 à 00:20, Gus Correa a écrit :

Hi Jorge

You may have an active firewall protecting either computer or both,
and preventing mpirun to start the connection.
Your /etc/hosts file may also not have the computer IP addresses.
You may also want to try the --hostfile option.
Likewise, the --verbose option may also help diagnose the problem.

It would help if you send the mpirun command line, the hostfile (if 
any),

error message if any, etc.


These FAQs may help diagnose and solve the problem:

https://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems
https://www.open-mpi.org/faq/?category=running#mpirun-hostfile
https://www.open-mpi.org/faq/?category=running

I hope this helps,
Gus Correa

On Tue, Oct 20, 2020 at 4:47 PM Jorge SILVA via users 
mailto:users@lists.open-mpi.org>> wrote:


Hello,

I installed kubuntu20.4.1 with openmpi 4.0.3-0ubuntu in two
different
computers in the standard way. Compiling with mpif90 works, but
mpirun
hangs with no output in both systems. Even mpirun command without
parameters hangs and only twice ctrl-C typing can end the sleeping
program. Only  the command

 mpirun --help

gives the usual output.

Seems that is something related to the terminal output, but the
command
worked well for Kubuntu 18.04. Is there a way to debug or fix this
problem (without re-compiling from sources, etc)? Is it a known
problem?

Thanks,

  Jorge




--
Jeff Squyres
jsquy...@cisco.com 



Re: [OMPI users] Limiting IP addresses used by OpenMPI

2020-09-01 Thread Joseph Schuchart via users

Charles,

What is the machine configuration you're running on? It seems that there 
are two MCA parameter for the tcp btl: btl_tcp_if_include and 
btl_tcp_if_exclude (see ompi_info for details). There may be other knobs 
I'm not aware of. If you're using UCX then my guess is that UCX has its 
own way to choose the network interface to be used...


Cheers
Joseph

On 9/1/20 9:35 PM, Charles Doland via users wrote:
Yes. It is not unusual to have multiple network interfaces on each host 
of a cluster. Usually there is a preference to use only one network 
interface on each host due to higher speed or throughput, or other 
considerations. It would be useful to be able to explicitly specify the 
interface to use for cases in which the MPI code does not select the 
preferred interface.


Charles Doland
charles.dol...@ansys.com 
(408) 627-6621  [x6621]

*From:* users  on behalf of John 
Hearns via users 

*Sent:* Tuesday, September 1, 2020 12:22 PM
*To:* Open MPI Users 
*Cc:* John Hearns 
*Subject:* Re: [OMPI users] Limiting IP addresses used by OpenMPI

*[External Sender]*

Charles, I recall using the I_MPI_NETMASK to choose which interface for 
MPI to use.

I guess you are asking the same question for OpenMPI?

On Tue, 1 Sep 2020 at 17:03, Charles Doland via users 
mailto:users@lists.open-mpi.org>> wrote:


Is there a way to limit the IP addresses or network interfaces used
for communication by OpenMPI? I am looking for something similar to
the I_MPI_TCP_NETMASK or I_MPI_NETMASK environment variables for
Intel MPI.

The OpenMPI documentation mentions the btl_tcp_if_include
and btl_tcp_if_exclude MCA options. These do not  appear to be
present, at least in OpenMPI v3.1.2. Is there another way to do
this? Or are these options supported in a different version?

Charles Doland
charles.dol...@ansys.com 
(408) 627-6621  [x6621]



Re: [OMPI users] Is the mpi.3 manpage out of date?

2020-08-31 Thread Joseph Schuchart via users

Andy,

Thanks for pointing this out. We have a merged a fix that corrects that 
stale comment in master :)


Cheers
Joseph

On 8/25/20 8:36 PM, Riebs, Andy via users wrote:
In searching to confirm my belief that recent versions of Open MPI 
support the MPI-3.1 standard, I was a bit surprised to find this in the 
mpi.3 man page from the 4.0.2 release:


“The  outcome,  known  as  the MPI Standard, was first published in 
1993; its most recent version (MPI-2) was published in July 1997. Open 
MPI 1.2 includes all MPI 1.2-compliant and MPI 2-compliant routines.”


(For those who are manpage-averse, see < 
https://www.open-mpi.org/doc/v4.0/man3/MPI.3.php>.)


I’m willing to bet that y’all haven’t been sitting on your hands since 
Open MPI 1.2 was released!


Andy

--

Andy Riebs

andy.ri...@hpe.com

Hewlett Packard Enterprise

High Performance Computing Software Engineering

+1 404 648 9024



Re: [OMPI users] Silent hangs with MPI_Ssend and MPI_Irecv

2020-07-25 Thread Joseph Schuchart via users

Hi Sean,

Thanks for the report! I have a few questions/suggestions:

1) What version of Open MPI are you using?
2) What is your network? It sounds like you are on an IB cluster using 
btl/openib (which is essentially discontinued). Can you try the Open MPI 
4.0.4 release with UCX instead of openib (configure with --without-verbs 
and --with-ucx)?
3) If that does not help, can you boil your code down to a minimum 
working example? That would make it easier for people to try to 
reproduce what happens.


Cheers
Joseph

On 7/24/20 11:34 PM, Lewis,Sean via users wrote:

Hi all,

I am encountering a silent hang involving MPI_Ssend and MPI_Irecv. The 
subroutine in question is called by each processor and is structured 
similar to the pseudo code below. The subroutine is successfully called 
several thousand times before the silent hang behavior manifests and 
never resolves. The hang will occur in nearly (but not exactly) the same 
spot for bit-wise identical tests. During the hang, all MPI ranks will 
be at the Line 18 Barrier except for two. One will be waiting at Line 
17, waiting for its Irecv to complete, and the other at one of the Ssend 
Line 9 or 14. This suggests that a MPI_Irecv never completes and a 
processor is indefinitely blocked in the Ssend unable to complete the 
transfer.


I’ve found similar discussion of this kind of behavior on the OpenMPI 
mailing list: 
https://www.mail-archive.com/users@lists.open-mpi.org/msg19227.html 
ultimately resolving in setting the mca parameter btl_openib_flags to 
304 or 305 (default 310): 
https://www.mail-archive.com/users@lists.open-mpi.org/msg19277.html. I 
have seen some promising behavior by doing the same. As the mailer 
suggests, this implies a problem with the RDMA protocols in infiniband 
for large messages.


I wanted to breathe life back into this conversation as the silent hang 
issue is particularly debilitating and confusing to me. 
Increasing/decreasing the number of processors used does not seem to 
alleviate the issue, using MPI_Send results in the same behavior, 
perhaps a message has exceeded a memory limit? I am running a test now 
that reports the individual message sizes but I previously implemented a 
switch to check for buffer size discrepancies which is not triggered. In 
the meantime, has anyone run into similar issues or have thoughts as to 
remedies for this behavior?


1:  call MPI_BARRIER(…)

2:  do i = 1,nprocs

3:   if(commatrix_recv(i) .gt. 0) then ! Identify which procs to 
receive from via predefined matrix


4: call Mpi_Irecv(…)

5:   endif

6:   enddo

7:   do j = mype+1,nproc

8:   if(commatrix_send(j) .gt. 0) then ! Identify which procs to 
send to via predefined matrix


9:     MPI_Ssend(…)

10: endif

11: enddo

12: do j = 1,mype

13:  if(commatrix_send(j) .gt. 0) then ! Identify which procs to 
send to via predefined matrix


14:    MPI_Ssend(…)

15: endif

16: enddo

17: call MPI_Waitall(…) ! Wait for all Irecv to complete

18: call MPI_Barrier(…)

Cluster information:

30 processors

Managed by slurm

OS: Red Hat v. 7.7

Thank you for help/advice you can provide,

Sean

*Sean C. Lewis*

Doctoral Candidate

Department of Physics

Drexel University



Re: [OMPI users] MPI test suite

2020-07-24 Thread Joseph Schuchart via users

You may want to look into MTT: https://github.com/open-mpi/mtt

Cheers
Joseph

On 7/23/20 8:28 PM, Zhang, Junchao via users wrote:

Hello,
   Does OMPI have a test suite that can let me validate MPI 
implementations from other vendors?


   Thanks
--Junchao Zhang





Re: [OMPI users] Vader - Where to Look for Shared Memory Use

2020-07-22 Thread Joseph Schuchart via users

Hi John,

Depending on your platform the default behavior of Open MPI is to mmap a 
shared backing file that is either located in a session directory under 
/dev/shm or under $TMPDIR (I believe under Linux it is /dev/shm). You 
will find a set of files there that are used to back shared memory. They 
should be deleted automatically at the end of a run.


What symptoms are you experiencing and on what platform?

Cheers
Joseph

On 7/22/20 10:15 AM, John Duffy via users wrote:

Hi

I’m trying to investigate an HPL Linpack scaling issue on a single node, 
increasing from 1 to 4 cores.

Regarding single node messages, I think I understand that Open-MPI will select 
the most efficient mechanism, which in this case I think should be vader shared 
memory.

But when I run Linpack, ipcs -m gives…

-- Shared Memory Segments 
keyshmid  owner  perms  bytes  nattch status


And, ipcs -u gives…

-- Messages Status 
allocated queues = 0
used headers = 0
used space = 0 bytes

-- Shared Memory Status 
segments allocated 0
pages allocated 0
pages resident  0
pages swapped   0
Swap performance: 0 attempts 0 successes

-- Semaphore Status 
used arrays = 0
allocated semaphores = 0


Am I looking in the wrong place to see how/if vader is using shared memory? I’m 
wondering if a slower mechanism is being used.

My ompi_info includes...

MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.0.3)
MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.3)
MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.3)
MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.3)


Best wishes



Re: [OMPI users] Coordinating (non-overlapping) local stores with remote puts form using passive RMA synchronization

2020-06-02 Thread Joseph Schuchart via users

Hi Stephen,

Let me try to answer your questions inline (I don't have extensive 
experience with the separate model and from my experience most 
implementations support the unified model, with some exceptions):


On 5/31/20 1:31 AM, Stephen Guzik via users wrote:

Hi,

I'm trying to get a better understanding of coordinating 
(non-overlapping) local stores with remote puts when using passive 
synchronization for RMA.  I understand that the window should be locked 
for a local store, but can it be a shared lock?


Yes. There is no reason why that cannot be a shared lock.

In my example, each 
process retrieves and increments an index (indexBuf and indexWin) from a 
target process and then stores it's rank into an array (dataBuf and 
dataWin) at that index on the target.  If the target is local, a local 
store is attempted:


/* indexWin on indexBuf, dataWin on dataBuf */
std::vector myvals(numProc);
MPI_Win_lock_all(0, indexWin);
MPI_Win_lock_all(0, dataWin);
for (int tgtProc = 0; tgtProc != numProc; ++tgtProc)
   {
     MPI_Fetch_and_op(, [tgtProc], MPI_INT, tgtProc, 0, 
MPI_SUM,indexWin);

     MPI_Win_flush_local(tgtProc, indexWin);
     // Put our rank into the right location of the target
     if (tgtProc == procID)
   {
     dataBuf[myvals[procID]] = procID;
   }
     else
   {
     MPI_Put(, 1, MPI_INT, tgtProc, myvals[tgtProc], 1, 
MPI_INT,dataWin);

   }
   }
MPI_Win_flush_all(dataWin);  /* Force completion and time 
synchronization */

MPI_Barrier(MPI_COMM_WORLD);
/* Proceed with local loads and unlock windows later */

I believe this is valid for a unified memory model but would probably 
fail for a separate model (unless a separate model very cleverly merges 
a private and public window?)  Is this understanding correct?  And if I 
instead use MPI_Put for the local write, then it should be valid for 
both memory models?


Yes, if you use RMA operations even on local memory it is valid for both 
memory models.


The MPI standard on page 455 (S3) states that "a store to process memory 
to a location in a window must not start once a put or accumulate update 
to that target window has started, until the put or accumulate update 
becomes visible in process memory." So there is no clever merging and it 
is up to the user to ensure that there are no puts and stores happening 
at the same time.




Another approach is specific locks.  I don't like this because it seems 
there are excessive synchronizations.  But if I really want to mix local 
stores and remote puts, is this the only way using locks?


/* indexWin on indexBuf, dataWin on dataBuf */
std::vector myvals(numProc);
for (int tgtProc = 0; tgtProc != numProc; ++tgtProc)
   {
     MPI_Win_lock(MPI_LOCK_SHARED, tgtProc, 0, indexWin);
     MPI_Fetch_and_op(, [tgtProc], MPI_INT, tgtProc, 0, 
MPI_SUM,indexWin);

     MPI_Win_unlock(tgtProc, indexWin);
     // Put our rank into the right location of the target
     if (tgtProc == procID)
   {
     MPI_Win_lock(MPI_LOCK_EXCLUSIVE, tgtProc, 0, dataWin);
     dataBuf[myvals[procID]] = procID;
     MPI_Win_unlock(tgtProc, dataWin);  /*(A)*/
   }
     else
   {
     MPI_Win_lock(MPI_LOCK_SHARED, tgtProc, 0, dataWin);
     MPI_Put(, 1, MPI_INT, tgtProc, myvals[tgtProc], 1, 
MPI_INT,dataWin);

     MPI_Win_unlock(tgtProc, dataWin);
   }
   }
/* Proceed with local loads */

I believe this is also valid for both memory models?  An unlock must 
have followed the last access to the local window, before the exclusive 
lock is gained.  That should have synchronized the windows and another 
synchronization should happen at (A).  Is that understanding correct? 


That is correct for both memory models, yes. It is likely to be slower 
because locking and unlocking involves some effort. You are better off 
using put instead.


If you really want to use local stores you can check for the 
MPI_WIN_UNIFIED attribute and fall-back to using puts only for the 
separate model.


> If so, how does one ever get into a situation where MPI_Win_sync must 
be used?


You can think of a synchronization scheme where each process takes a 
shared lock on a window, stores data to a local location, calls 
MPI_Win_sync and signals to other processes that the data is now 
available, e.g., through a barrier or a send. In that case processes 
keep the lock and use some non-RMA synchronization instead.




Final question.  In the first example, let's say there is a lot of 
computation in the loop and I want the MPI_Puts to immediately make 
progress.  Would it be sensible to follow the MPI_Put with a 
MPI_Win_flush_local to get things moving?  Or is it best to avoid any 
unnecessary synchronizations?


That is highly implementation-specific. Some implementations may buffer 
the puts and delay the transfer to the flush, some may initiate it 
immediately, and some may treat a local flush similar to a regular 
flush. I would not make any assumptions about the underlying 

Re: [OMPI users] RMA in openmpi

2020-04-27 Thread Joseph Schuchart via users

Hi Claire,

You cannot use MPI_Get (or any other RMA communication routine) on a 
window for which no access epoch has been started. MPI_Win_fence starts 
an active target access epoch, MPI_Win_lock[_all] start a passive target 
access epoch. Window locks are synchronizing in the sense that they 
provide a means for mutual exclusion if an exclusive lock is involved (a 
process holding a shared window lock allows for other processes to 
acquire shared locks but prevents them from taking an exclusive lock, 
and vice versa).


One common strategy is to call MPI_Win_lock_all on all processes to let 
all processes acquire a shared lock, which they hold until the end of 
the application run. Communication is then done using a combination of 
MPI_Get/MPI_Put/accumulate functions and flushes. As said earlier, you 
likely will need to take care of synchronization among the processes if 
they also modify data in the window.


Cheers
Joseph

On 4/27/20 12:14 PM, Claire Cashmore wrote:

Hi Joseph

Thank you for your reply. From what I had been reading I thought they were both called 
"synchronization calls" just that one was passive (lock) and one was active 
(fence), sorry if I've got confused!
So I'm asking do need either MPI_Win_fence or MPI_Win_unlock/lock in order to 
use one-sided calls, and is it not possible to use one-sided communication 
without them? So just a stand alone MPI_Get, without the other calls before and 
after? It seems not from what you are saying, but I just wanted to confirm.

Thanks again

Claire

On 27/04/2020, 07:50, "Joseph Schuchart via users"  
wrote:

 Claire,

  > Is it possible to use the one-sided communication without combining
 it with synchronization calls?

 What exactly do you mean by "synchronization calls"? MPI_Win_fence is
 indeed synchronizing (basically flush+barrier) but MPI_Win_lock (and the
 passive target synchronization interface at large) is not. It does incur
 some overhead because the lock has to be taken somehow at some point.
 However, it does not require a matching call at the target to complete.

 You can lock a window using a (shared or exclusive) lock, initiate RMA
 operations, flush them to wait for their completion, and initiate the
 next set of RMA operations to flush later. None of these calls are
 synchronizing. You will have to perform your own synchronization at some
 point though to make sure processes read consistent data.

 HTH!
 Joseph


 On 4/24/20 5:34 PM, Claire Cashmore via users wrote:
 > Hello
 >
 > I was wondering if someone could help me with a question.
 >
 > When using RMA is there a requirement to use some type of
 > synchronization? When using one-sided communication such as MPI_Get the
 > code will only run when I combine it with MPI_Win_fence or
 > MPI_Win_lock/unlock. I do not want to use MPI_Win_fence as I’m using the
 > one-sided communication to allow some communication when processes are
 > not synchronised, so this defeats the point. I could use
 > MPI_Win_lock/unlock, however someone I’ve spoken to has said that I
 > should be able to use RMA without any synchronization calls, if so then
 > I would prefer to do this to reduce any overheads using MPI_Win_lock
 > every time I use the one-sided communication may produce.
 >
 > Is it possible to use the one-sided communication without combining it
 > with synchronization calls?
 >
 > (It doesn’t seem to matter what version of openmpi I use).
 >
 > Thank you
 >
 > Claire
 >



Re: [OMPI users] RMA in openmpi

2020-04-27 Thread Joseph Schuchart via users

Claire,

> Is it possible to use the one-sided communication without combining 
it with synchronization calls?


What exactly do you mean by "synchronization calls"? MPI_Win_fence is 
indeed synchronizing (basically flush+barrier) but MPI_Win_lock (and the 
passive target synchronization interface at large) is not. It does incur 
some overhead because the lock has to be taken somehow at some point. 
However, it does not require a matching call at the target to complete.


You can lock a window using a (shared or exclusive) lock, initiate RMA 
operations, flush them to wait for their completion, and initiate the 
next set of RMA operations to flush later. None of these calls are 
synchronizing. You will have to perform your own synchronization at some 
point though to make sure processes read consistent data.


HTH!
Joseph


On 4/24/20 5:34 PM, Claire Cashmore via users wrote:

Hello

I was wondering if someone could help me with a question.

When using RMA is there a requirement to use some type of 
synchronization? When using one-sided communication such as MPI_Get the 
code will only run when I combine it with MPI_Win_fence or 
MPI_Win_lock/unlock. I do not want to use MPI_Win_fence as I’m using the 
one-sided communication to allow some communication when processes are 
not synchronised, so this defeats the point. I could use 
MPI_Win_lock/unlock, however someone I’ve spoken to has said that I 
should be able to use RMA without any synchronization calls, if so then 
I would prefer to do this to reduce any overheads using MPI_Win_lock 
every time I use the one-sided communication may produce.


Is it possible to use the one-sided communication without combining it 
with synchronization calls?


(It doesn’t seem to matter what version of openmpi I use).

Thank you

Claire



[OMPI users] Question about UCX progress throttling

2020-02-07 Thread Joseph Schuchart via users
Today I came across the two MCA parameters osc_ucx_progress_iterations 
and pml_ucx_progress_iterations in Open MPI. My interpretation of the 
description is that in a loop such as below, progress in UCX is only 
triggered every 100 iterations (assuming opal_progress is only called 
once per MPI_Test call):


```
int flag = 0;
MPI_Request req;

while (!flag) {
  do_something_else();
  MPI_Test(, );
}
```

Is that assumption correct? What is the reason behind this throttling? 
In combination with TAMPI, it appears that setting them to 1 yields a 
significant speedup. Is it safe to always set them to 1?


Thanks
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de


Re: [OMPI users] mpirun --output-filename behavior

2019-10-31 Thread Joseph Schuchart via users

On 10/30/19 2:06 AM, Jeff Squyres (jsquyres) via users wrote:


Oh, did the prior behavior *only* output to the file and not to 
stdout/stderr?  Huh.


I guess a workaround for that would be:

     mpirun  ... > /dev/null


Just to throw in my $0.02: I recently found that the output to 
stdout/stderr may not be desirable: in an application that writes a lot 
of log data to stderr on all ranks, stdout was significantly slower than 
the files I redirected stdio to (I ended up seeing the application 
complete in the file output while the terminal wasn't even halfway 
through). Redirecting stderr to /dev/null as Jeff suggests does not help 
much because the output first has to be sent to the head node.


Things got even worse when I tried to use the stdout redirection with 
DDT: it barfed at me for doing pipe redirection in the command 
specification! The DDT terminal is just really slow and made the whole 
exercise worthless.


Point to make: it would be nice to have an option to suppress the output 
on stdout and/or stderr when output redirection to file is requested. In 
my case, having stdout still visible on the terminal is desirable but 
having a way to suppress output of stderr to the terminal would be 
immensely helpful.


Joseph



--
Jeff Squyres
jsquy...@cisco.com 



[OMPI users] CPC only supported when the first QP is a PP QP?

2019-08-05 Thread Joseph Schuchart via users
I'm trying to run an MPI RMA application on an IB cluster and find that 
Open MPI is using the pt2pt rdma component instead of openib (or UCX). I 
tried getting some logs from Open MPI (current 3.1.x git):


```
$ mpirun -n 2 --mca btl_base_verbose 100 --mca osc_base_verbose 100 
--mca osc_rdma_verbose 100 ./a.out
[taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base: 
components_open: opening osc components
[taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base: 
components_open: found loaded component sm
[taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base: 
components_open: component sm open function successful
[taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base: 
components_open: found loaded component monitoring
[taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base: 
components_open: found loaded component pt2pt
[taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base: 
components_open: found loaded component rdma
[taurusi6606.taurus.hrsk.tu-dresden.de:08214] rdmacm CPC only supported 
when the first QP is a PP QP; skipped
[taurusi6606.taurus.hrsk.tu-dresden.de:08214] openib BTL: rdmacm CPC 
unavailable for use on mlx5_0:1; skipped
[taurusi6606.taurus.hrsk.tu-dresden.de:08214] [rank=0] openib: using 
port mlx5_0:1
[taurusi6606.taurus.hrsk.tu-dresden.de:08214] select: init of component 
openib returned success
[taurusi6606.taurus.hrsk.tu-dresden.de:08214] select: initializing btl 
component tcp
[taurusi6606.taurus.hrsk.tu-dresden.de:08214] btl: tcp: Searching for 
exclude address+prefix: 127.0.0.1 / 8
[taurusi6606.taurus.hrsk.tu-dresden.de:08214] btl: tcp: Found match: 
127.0.0.1 (lo)

```

Is there any information on what makes "rdmacm CPC unavailable for use"? 
I cannot make much sense of "rdmacm CPC only supported when the first QP 
is a PP QP"... Is this a configuration problem of the system? A problem 
with the software stack?


If I try the same using Open MPI 4.0.x it reports:
```
[taurusi6607.taurus.hrsk.tu-dresden.de:21681] Process is not bound: 
distance to device is 0.00

--
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   taurusi6606
  Local device: mlx5_0
--
[taurusi6606.taurus.hrsk.tu-dresden.de:09069] select: init of component 
openib returned failure

```

The message about rdmacm does not show up.

The system has mlx5 devices:

```
$ ~/opt/openmpi-v3.1.x/bin/mpirun -n 2 ibv_devices
device node GUID
--  
mlx5_0  08003800013c7507
device node GUID
--  
mlx5_0  08003800013c773b
```

Any help would be much appreciated!

Thanks,
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Joseph Schuchart via users

Noam,

Another idea: check for stale files in /dev/shm/ (or a subdirectory that 
looks like it belongs to UCX/OpenMPI) and SysV shared memory using `ipcs 
-m`.


Joseph

On 6/20/19 3:31 PM, Noam Bernstein via users wrote:



On Jun 20, 2019, at 4:44 AM, Charles A Taylor > wrote:


This looks a lot like a problem I had with OpenMPI 3.1.2.  I thought 
the fix was landed in 4.0.0 but you might
want to check the code to be sure there wasn’t a regression in 4.1.x. 
 Most of our codes are still running
3.1.2 so I haven’t built anything beyond 4.0.0 which definitely 
included the fix.


Unfortunately, 4.0.0 behaves the same.

One thing that I’m wondering if anyone familiar with the internals can 
explain is how you get a memory leak that isn’t freed when then program 
ends?  Doesn’t that suggest that it’s something lower level, like maybe 
a kernel issue?


Noam


|
|
|
*U.S. NAVAL*
|
|
_*RESEARCH*_
|
LABORATORY

Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2019-05-09 Thread Joseph Schuchart via users

Benson,

I just gave 4.0.1 a shot and the behavior is the same (the reason I'm 
stuck with 3.1.2 is a regression with `osc_rdma_acc_single_intrinsic` on 
4.0 [1]).


The IB cluster has both Mellanox ConnectX-3 (w/ Haswell CPU) and 
ConnectX-4 (w/ Skylake CPU) nodes, the effect is visible on both node types.


Joseph

[1] https://github.com/open-mpi/ompi/issues/6536

On 5/9/19 9:10 AM, Benson Muite via users wrote:

Hi,

Have you tried anything with OpenMPI 4.0.1?

What are the specifications of the Infiniband system you are using?

Benson

On 5/9/19 9:37 AM, Joseph Schuchart via users wrote:

Nathan,

Over the last couple of weeks I made some more interesting 
observations regarding the latencies of accumulate operations on both 
Aries and InfiniBand systems:


1) There seems to be a significant difference between 64bit and 32bit 
operations: on Aries, the average latency for compare-exchange on 
64bit values takes about 1.8us while on 32bit values it's at 3.9us, a 
factor of >2x. On the IB cluster, all of fetch-and-op, 
compare-exchange, and accumulate show a similar difference between 32 
and 64bit. There are no differences between 32bit and 64bit puts and 
gets on these systems.


2) On both systems, the latency for a single-value atomic load using 
MPI_Fetch_and_op + MPI_NO_OP is 2x that of MPI_Fetch_and_op + MPI_SUM 
on 64bit values, roughly matching the latency of 32bit 
compare-exchange operations.


All measurements were done using Open MPI 3.1.2 with 
OMPI_MCA_osc_rdma_acc_single_intrinsic=true. Is that behavior expected 
as well?


Thanks,
Joseph


On 11/6/18 6:13 PM, Nathan Hjelm via users wrote:


All of this is completely expected. Due to the requirements of the 
standard it is difficult to make use of network atomics even for 
MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the 
party). If you want MPI_Fetch_and_op to be fast set this MCA parameter:


osc_rdma_acc_single_intrinsic=true

Shared lock is slower than an exclusive lock because there is an 
extra lock step as part of the accumulate (it isn't needed if there 
is an exclusive lock). When setting the above parameter you are 
telling the implementation that you will only be using a single count 
and we can optimize that with the hardware. The RMA working group is 
working on an info key that will essentially do the same thing.


Note the above parameter won't help you with IB if you are using UCX 
unless you set this (master only right now):


btl_uct_transports=dc_mlx5

btl=self,vader,uct

osc=^ucx


Though there may be a way to get osc/ucx to enable the same sort of 
optimization. I don't know.



-Nathan


On Nov 06, 2018, at 09:38 AM, Joseph Schuchart  
wrote:



All,

I am currently experimenting with MPI atomic operations and wanted to
share some interesting results I am observing. The numbers below are
measurements from both an IB-based cluster and our Cray XC40. The
benchmarks look like the following snippet:

```
if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
MPI_Fetch_and_op(, , MPI_UINT32_T, 0, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
}
}
MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
have tried to confirm that the operations are done in hardware by
letting rank 0 sleep for a while and ensuring that communication
progresses). Of particular interest for my use-case is fetch_op but 
I am

including other operations here nevertheless:

* Linux Cluster, IB QDR *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies
than operations on 32bit
b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t and
ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate
(compare_exchange seems to be somewhat of an outlier).

* Cray XC40, Aries *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 2.011794us
compare_exchange: 1.740825us
accumulate: 1.795500us
get_accumulate: 1.985409us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.017172us
compare_exchange: 1.846202us
accumulate: 1.812578us
get_accumulate: 2.005541us

Shared lock, MPI_UINT32_T:
fetch_op: 5.380455us
compare_exchange: 5.164458us
accumulate: 5.230184us
get_accumulate: 5.399722us

Shared lock, MPI_UINT64_T:
fetch_op: 5.415230us
compare_exchange: 1.855840us
accumulate: 5.21

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2019-05-09 Thread Joseph Schuchart via users

Nathan,

Over the last couple of weeks I made some more interesting observations 
regarding the latencies of accumulate operations on both Aries and 
InfiniBand systems:


1) There seems to be a significant difference between 64bit and 32bit 
operations: on Aries, the average latency for compare-exchange on 64bit 
values takes about 1.8us while on 32bit values it's at 3.9us, a factor 
of >2x. On the IB cluster, all of fetch-and-op, compare-exchange, and 
accumulate show a similar difference between 32 and 64bit. There are no 
differences between 32bit and 64bit puts and gets on these systems.


2) On both systems, the latency for a single-value atomic load using 
MPI_Fetch_and_op + MPI_NO_OP is 2x that of MPI_Fetch_and_op + MPI_SUM on 
64bit values, roughly matching the latency of 32bit compare-exchange 
operations.


All measurements were done using Open MPI 3.1.2 with 
OMPI_MCA_osc_rdma_acc_single_intrinsic=true. Is that behavior expected 
as well?


Thanks,
Joseph


On 11/6/18 6:13 PM, Nathan Hjelm via users wrote:


All of this is completely expected. Due to the requirements of the 
standard it is difficult to make use of network atomics even for 
MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the 
party). If you want MPI_Fetch_and_op to be fast set this MCA parameter:


osc_rdma_acc_single_intrinsic=true

Shared lock is slower than an exclusive lock because there is an extra 
lock step as part of the accumulate (it isn't needed if there is an 
exclusive lock). When setting the above parameter you are telling the 
implementation that you will only be using a single count and we can 
optimize that with the hardware. The RMA working group is working on an 
info key that will essentially do the same thing.


Note the above parameter won't help you with IB if you are using UCX 
unless you set this (master only right now):


btl_uct_transports=dc_mlx5

btl=self,vader,uct

osc=^ucx


Though there may be a way to get osc/ucx to enable the same sort of 
optimization. I don't know.



-Nathan


On Nov 06, 2018, at 09:38 AM, Joseph Schuchart  wrote:


All,

I am currently experimenting with MPI atomic operations and wanted to
share some interesting results I am observing. The numbers below are
measurements from both an IB-based cluster and our Cray XC40. The
benchmarks look like the following snippet:

```
if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
MPI_Fetch_and_op(, , MPI_UINT32_T, 0, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
}
}
MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
have tried to confirm that the operations are done in hardware by
letting rank 0 sleep for a while and ensuring that communication
progresses). Of particular interest for my use-case is fetch_op but I am
including other operations here nevertheless:

* Linux Cluster, IB QDR *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies
than operations on 32bit
b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t and
ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate
(compare_exchange seems to be somewhat of an outlier).

* Cray XC40, Aries *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 2.011794us
compare_exchange: 1.740825us
accumulate: 1.795500us
get_accumulate: 1.985409us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.017172us
compare_exchange: 1.846202us
accumulate: 1.812578us
get_accumulate: 2.005541us

Shared lock, MPI_UINT32_T:
fetch_op: 5.380455us
compare_exchange: 5.164458us
accumulate: 5.230184us
get_accumulate: 5.399722us

Shared lock, MPI_UINT64_T:
fetch_op: 5.415230us
compare_exchange: 1.855840us
accumulate: 5.212632us
get_accumulate: 5.396110us


The difference between exclusive and shared lock is about the same as
with IB and the latencies for 32bit vs 64bit are roughly the same
(except for compare_exchange, it seems).

So my question is: is this to be expected? Is the higher latency when
using a shared lock caused by an internal lock being acquired because
the hardware operations are not actually atomic?

I'd be grateful for any insight on this.

Cheers,
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: