Re: [OMPI users] Not getting zero-copy with custom datatype

2024-04-29 Thread George Bosilca via users
Like a vector where the stride and the blocklen are the same. There is an
optimizer in MPI_Type_Commit that tries to reshape the datatype description
to represent the same memory layout but involving less memcpy during
pack/unpack.

What you are asking can be done, but it is relatively complex, and
the benefit is not obvious, despite the fact that it sounds evident, aka.
zero copy gives you better bandwidth. One of the peers will need to expose
the entire span of the memory layout to the other process (this is usually
an expensive operation), and will also need to provide the peer with the
description of the datatype (adding extra latency to the operation). With
this info the peer can then create a union of memcpy between its local
datatype and the remote datatype, and issue them all resulting in a
zero-copy communication.

  George.


On Fri, Apr 26, 2024 at 4:21 AM Pascal Boeschoten via users <
users@lists.open-mpi.org> wrote:

> Hello George,
>
> Thank you, that's good to know. I expected the datatype to be enough of a
> description of the memory layout, since it's all on the same machine.
> Would you be able to clarify what you mean with "can be seen as contiguous
> (even if described otherwise)"? In what way could it be described
> otherwise, but still be seen as contiguous?
>
> Thanks,
> Pascal Boeschoten
>
> On Tue, 23 Apr 2024 at 16:05, George Bosilca  wrote:
>
>> zero copy does not work with non-contiguous datatypes (it would require
>> both processes to know the memory layout used by the peer). As long as the
>> memory layout described by the type can be seen as contiguous (even if
>> described otherwise), it should work just fine.
>>
>>   George.
>>
>> On Tue, Apr 23, 2024 at 10:02 AM Pascal Boeschoten via users <
>> users@lists.open-mpi.org> wrote:
>>
>>> Hello,
>>>
>>> I'm using a custom datatype created through MPI_Type_create_struct() to
>>> send data with a dynamic structure to another process on the same node over
>>> shared memory, and noticed it's much slower than expected.
>>>
>>> I ran a profile, and it looks like it's not using CMA zero-copy, falling
>>> back to using opal_generic_simple_pack()/opal_generic_simple_unpack().
>>> Simpler datatypes do seem to use zero-copy, using
>>> mca_btl_vader_get_cma(), so I don't think it's a configuration or system
>>> issue.
>>>
>>> I suspect it's because the struct datatype is not contiguous, i.e. the
>>> blocks of the struct have gaps between them.
>>> Is anyone able to confirm whether zero-copy with an MPI struct requires
>>> a contiguous data structure, and whether it has other requirements like the
>>> displacements being in ascending order, having homogeneous block
>>> datatypes/lengths, etc?
>>>
>>> I'm using OpenMPI 4.1.6.
>>>
>>> Thanks,
>>> Pascal Boeschoten
>>>
>>


Re: [OMPI users] Not getting zero-copy with custom datatype

2024-04-23 Thread George Bosilca via users
zero copy does not work with non-contiguous datatypes (it would require
both processes to know the memory layout used by the peer). As long as the
memory layout described by the type can be seen as contiguous (even if
described otherwise), it should work just fine.

  George.

On Tue, Apr 23, 2024 at 10:02 AM Pascal Boeschoten via users <
users@lists.open-mpi.org> wrote:

> Hello,
>
> I'm using a custom datatype created through MPI_Type_create_struct() to
> send data with a dynamic structure to another process on the same node over
> shared memory, and noticed it's much slower than expected.
>
> I ran a profile, and it looks like it's not using CMA zero-copy, falling
> back to using opal_generic_simple_pack()/opal_generic_simple_unpack().
> Simpler datatypes do seem to use zero-copy, using mca_btl_vader_get_cma(),
> so I don't think it's a configuration or system issue.
>
> I suspect it's because the struct datatype is not contiguous, i.e. the
> blocks of the struct have gaps between them.
> Is anyone able to confirm whether zero-copy with an MPI struct requires a
> contiguous data structure, and whether it has other requirements like the
> displacements being in ascending order, having homogeneous block
> datatypes/lengths, etc?
>
> I'm using OpenMPI 4.1.6.
>
> Thanks,
> Pascal Boeschoten
>


Re: [OMPI users] UFLM only works on a single node???

2024-03-24 Thread George Bosilca via users
All the examples work for me on using ULFM ge87f595 compiled with
minimalistic options:
'--prefix=XXX --enable-picky --enable-debug --disable-heterogeneous
--enable-contrib-no-build=vt --enable-mpirun-prefix-by-default
--enable-mpi-ext=ftmpi --with-ft=mpi --with-pmi'.

I run using ipoib, so I select the sm,self, tcp BTL and the OB1 PML.

  George.


On Sat, Mar 23, 2024 at 6:33 PM Dean Anderson via users <
users@lists.open-mpi.org> wrote:

> If someone could take a look at
> https://github.com/open-mpi/ompi/issues/11404
> and provide some guidance or a work around, I’d appreciate it.
>
> The SC-22 Tutorials work just fine, but only on a single node.  If you
> arrange multiple nodes, it hangs in MPI_Finalize.
>
> I attended the SC22 Tutorial and it was not my impression that UFLM only
> worked if your tasks were all on a single node.
>
>
>
> Thanks!!
>
>
> Sent from my iPhone
>


Re: [OMPI users] Homebrew-installed OpenMPI 5.0.1 can't run a simple test program

2024-02-05 Thread George Bosilca via users
gt;>>   Internal debug support: no
>>>   MPI interface warnings: yes
>>>  MPI parameter check: runtime
>>> Memory profiling support: no
>>> Memory debugging support: no
>>>   dl support: yes
>>>Heterogeneous support: no
>>>MPI_WTIME support: native
>>>  Symbol vis. support: yes
>>>Host topology support: yes
>>> IPv6 support: yes
>>>   MPI extensions: affinity, cuda, ftmpi, rocm, shortfloat
>>>  Fault Tolerance support: yes
>>>   FT MPI support: yes
>>>   MPI_MAX_PROCESSOR_NAME: 256
>>> MPI_MAX_ERROR_STRING: 256
>>>  MPI_MAX_OBJECT_NAME: 64
>>> MPI_MAX_INFO_KEY: 36
>>> MPI_MAX_INFO_VAL: 256
>>>MPI_MAX_PORT_NAME: 1024
>>>   MPI_MAX_DATAREP_STRING: 128
>>>  MCA accelerator: null (MCA v2.1.0, API v1.0.0, Component v5.0.1)
>>>MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component
>>> v5.0.1)
>>>MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component
>>> v5.0.1)
>>>MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component
>>> v5.0.1)
>>>  MCA btl: self (MCA v2.1.0, API v3.3.0, Component v5.0.1)
>>>  MCA btl: sm (MCA v2.1.0, API v3.3.0, Component v5.0.1)
>>>  MCA btl: tcp (MCA v2.1.0, API v3.3.0, Component v5.0.1)
>>>   MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component
>>> v5.0.1)
>>>   MCA if: bsdx_ipv6 (MCA v2.1.0, API v2.0.0, Component
>>>   v5.0.1)
>>>   MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
>>>   v5.0.1)
>>>  MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v5.0.1)
>>>  MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component
>>> v5.0.1)
>>>MCA mpool: hugepage (MCA v2.1.0, API v3.1.0, Component
>>> v5.0.1)
>>>  MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
>>>   v5.0.1)
>>>   MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component
>>> v5.0.1)
>>>MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component
>>> v5.0.1)
>>>MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v5.0.1)
>>>MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component
>>> v5.0.1)
>>>MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v5.0.1)
>>>  MCA threads: pthreads (MCA v2.1.0, API v1.0.0, Component
>>> v5.0.1)
>>>MCA timer: darwin (MCA v2.1.0, API v2.0.0, Component
>>> v5.0.1)
>>>  MCA bml: r2 (MCA v2.1.0, API v2.1.0, Component v5.0.1)
>>> MCA coll: adapt (MCA v2.1.0, API v2.4.0, Component
>>> v5.0.1)
>>> MCA coll: basic (MCA v2.1.0, API v2.4.0, Component
>>> v5.0.1)
>>> MCA coll: han (MCA v2.1.0, API v2.4.0, Component v5.0.1)
>>> MCA coll: inter (MCA v2.1.0, API v2.4.0, Component
>>> v5.0.1)
>>> MCA coll: libnbc (MCA v2.1.0, API v2.4.0, Component
>>> v5.0.1)
>>> MCA coll: self (MCA v2.1.0, API v2.4.0, Component v5.0.1)
>>> MCA coll: sync (MCA v2.1.0, API v2.4.0, Component v5.0.1)
>>> MCA coll: tuned (MCA v2.1.0, API v2.4.0, Component
>>> v5.0.1)
>>> MCA coll: ftagree (MCA v2.1.0, API v2.4.0, Component
>>> v5.0.1)
>>> MCA coll: monitoring (MCA v2.1.0, API v2.4.0, Component
>>>   v5.0.1)
>>> MCA coll: sm (MCA v2.1.0, API v2.4.0, Component v5.0.1)
>>> MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component
>>> v5.0.1)
>>>MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component
>>> v5.0.1)
>>>MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
>>>   v5.0.1)
>>>MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
>>>   v5.0.1)
>>>MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component
>>> v5.0.1)
>>>   MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v5.0.1)
>>> MCA hook: comm_method (MCA v2.1.0, API v1.0.0, Component
>>>

Re: [OMPI users] Homebrew-installed OpenMPI 5.0.1 can't run a simple test program

2024-02-05 Thread George Bosilca via users
   MPI_MAX_ERROR_STRING: 256
>>  MPI_MAX_OBJECT_NAME: 64
>> MPI_MAX_INFO_KEY: 36
>> MPI_MAX_INFO_VAL: 256
>>MPI_MAX_PORT_NAME: 1024
>>   MPI_MAX_DATAREP_STRING: 128
>>  MCA accelerator: null (MCA v2.1.0, API v1.0.0, Component v5.0.1)
>>MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v5.0.1)
>>MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component
>> v5.0.1)
>>MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component
>> v5.0.1)
>>  MCA btl: self (MCA v2.1.0, API v3.3.0, Component v5.0.1)
>>  MCA btl: sm (MCA v2.1.0, API v3.3.0, Component v5.0.1)
>>  MCA btl: tcp (MCA v2.1.0, API v3.3.0, Component v5.0.1)
>>   MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component
>> v5.0.1)
>>   MCA if: bsdx_ipv6 (MCA v2.1.0, API v2.0.0, Component
>>   v5.0.1)
>>   MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
>>   v5.0.1)
>>  MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v5.0.1)
>>  MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component
>> v5.0.1)
>>MCA mpool: hugepage (MCA v2.1.0, API v3.1.0, Component
>> v5.0.1)
>>  MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
>>   v5.0.1)
>>   MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v5.0.1)
>>MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component
>> v5.0.1)
>>MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v5.0.1)
>>MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v5.0.1)
>>MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v5.0.1)
>>  MCA threads: pthreads (MCA v2.1.0, API v1.0.0, Component
>> v5.0.1)
>>MCA timer: darwin (MCA v2.1.0, API v2.0.0, Component
>> v5.0.1)
>>  MCA bml: r2 (MCA v2.1.0, API v2.1.0, Component v5.0.1)
>> MCA coll: adapt (MCA v2.1.0, API v2.4.0, Component v5.0.1)
>> MCA coll: basic (MCA v2.1.0, API v2.4.0, Component v5.0.1)
>> MCA coll: han (MCA v2.1.0, API v2.4.0, Component v5.0.1)
>> MCA coll: inter (MCA v2.1.0, API v2.4.0, Component v5.0.1)
>> MCA coll: libnbc (MCA v2.1.0, API v2.4.0, Component
>> v5.0.1)
>> MCA coll: self (MCA v2.1.0, API v2.4.0, Component v5.0.1)
>> MCA coll: sync (MCA v2.1.0, API v2.4.0, Component v5.0.1)
>> MCA coll: tuned (MCA v2.1.0, API v2.4.0, Component v5.0.1)
>> MCA coll: ftagree (MCA v2.1.0, API v2.4.0, Component
>> v5.0.1)
>> MCA coll: monitoring (MCA v2.1.0, API v2.4.0, Component
>>   v5.0.1)
>> MCA coll: sm (MCA v2.1.0, API v2.4.0, Component v5.0.1)
>> MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v5.0.1)
>>MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component
>> v5.0.1)
>>MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
>>   v5.0.1)
>>MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
>>   v5.0.1)
>>MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component
>> v5.0.1)
>>   MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v5.0.1)
>> MCA hook: comm_method (MCA v2.1.0, API v1.0.0, Component
>>   v5.0.1)
>>   MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v5.0.1)
>>   MCA io: romio341 (MCA v2.1.0, API v2.0.0, Component
>> v5.0.1)
>>  MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v5.0.1)
>>  MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component
>>   v5.0.1)
>>  MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v5.0.1)
>> MCA part: persist (MCA v2.1.0, API v4.0.0, Component
>> v5.0.1)
>>  MCA pml: cm (MCA v2.1.0, API v2.1.0, Component v5.0.1)
>>  MCA pml: monitoring (MCA v2.1.0, API v2.1.0, Component
>>   v5.0.1)
>>  MCA pml: ob1 (MCA v2.1.0, API v2.1.0, Component v5.0.1)
>>  MCA pml: v (MCA v2.1.0, API v2.1.0, Component v5.0.1)
>> MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
>>   

Re: [OMPI users] Homebrew-installed OpenMPI 5.0.1 can't run a simple test program

2024-02-05 Thread George Bosilca via users
OMPI seems unable to create a communication medium between your processes.
There are few known issues on OSX, please read
https://github.com/open-mpi/ompi/issues/12273 for more info.

Can you provide the header of the ompi_info command. What I'm interested on
is the part about `Configure command line:`

George.


On Mon, Feb 5, 2024 at 12:18 PM John Haiducek via users <
users@lists.open-mpi.org> wrote:

> I'm having problems running programs compiled against the OpenMPI 5.0.1
> package provided by homebrew on MacOS (arm) 12.6.1.
>
> When running a Fortran test program that simply calls MPI_init followed by
> MPI_Finalize, I get the following output:
>
> $ mpirun -n 2 ./mpi_init_test
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
>   PML add procs failed
>   --> Returned "Not found" (-13) instead of "Success" (0)
> --
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
>   ompi_mpi_init: ompi_mpi_instance_init failed
>   --> Returned "Not found" (-13) instead of "Success" (0)
> --
> [haiducek-lt:0] *** An error occurred in MPI_Init
> [haiducek-lt:0] *** reported by process [1905590273,1]
> [haiducek-lt:0] *** on a NULL communicator
> [haiducek-lt:0] *** Unknown error
> [haiducek-lt:0] *** MPI_ERRORS_ARE_FATAL (processes in this
> communicator will now abort,
> [haiducek-lt:0] ***and MPI will try to terminate your MPI job as
> well)
> --
> prterun detected that one or more processes exited with non-zero status,
> thus causing the job to be terminated. The first process to do so was:
>
>Process name: [prterun-haiducek-lt-15584@1,1] Exit code:14
> --
>
> I'm not sure whether this is the result of a bug in OpenMPI, in the
> homebrew package, or a misconfiguration of my system. Any suggestions for
> troubleshooting this?
>


Re: [OMPI users] [EXT] Re: Error handling

2023-07-19 Thread George Bosilca via users
I think the root cause was that he expected the negative integer resulting
from the reduction to be the exit code of the application, and as I
explained in my prior email that's not how exit() works.

The exit() issue aside, MPI_Abort seems to be the right function for this
usage.

  George.


On Wed, Jul 19, 2023 at 11:08 AM Jeff Squyres (jsquyres) 
wrote:

> MPI_Allreduce should work just fine, even with negative numbers.  If you
> are seeing something different, can you provide a small reproducer program
> that shows the problem?  We can dig deeper into if if we can reproduce the
> problem.
>
> mpirun's exit status can't distinguish between MPI processes who call
> MPI_Finalize and then return a non-zero exit status and those who invoked
> MPI_Abort.  But if you have 1 process that invokes MPI_Abort with an exit
> status <255, it should be reflected in mpirun's exit status.  For example:
>
> $ cat abort.c
>
> #include 
>
> #include 
>
>
> int main(int argc, char *argv[])
>
> {
>
> int i, rank, size;
>
>
> MPI_Init(NULL, NULL);
>
> MPI_Comm_rank(MPI_COMM_WORLD, );
>
> MPI_Comm_size(MPI_COMM_WORLD, );
>
>
> if (rank == size - 1) {
>
> int err_code = 79;
>
> fprintf(stderr, "I am rank %d and am aborting with error code
> %d\n",
>
> rank, err_code);
>
> MPI_Abort(MPI_COMM_WORLD, err_code);
>
> }
>
>
> fprintf(stderr, "I am rank %d and am exiting with 0\n", rank);
>
> MPI_Finalize();
>
> return 0;
>
> }
>
>
> $ mpicc abort.c -o abort
>
>
> $ mpirun --host mpi004:2,mpi005:2 -np 4 ./abort
>
> I am rank 0 and am exiting with 0
>
> I am rank 1 and am exiting with 0
>
> I am rank 2 and am exiting with 0
>
> I am rank 3 and am aborting with error code 79
>
> --
>
> MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
>
> with errorcode 79.
>
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>
> You may or may not see output from other processes, depending on
>
> exactly when Open MPI kills them.
>
> --
>
>
> $ echo $?
>
> 79
>
> --
> *From:* users  on behalf of Alexander
> Stadik via users 
> *Sent:* Wednesday, July 19, 2023 12:45 AM
> *To:* George Bosilca ; Open MPI Users <
> users@lists.open-mpi.org>
> *Cc:* Alexander Stadik 
> *Subject:* Re: [OMPI users] [EXT] Re: Error handling
>
> Hey George,
>
> I said random only because I do not see the method behind it, but exactly
> like this when I do allreduce by MIN and return a negative number I get
> either 248, 253, 11 or 6 usually. Meaning that's purely a number from MPI
> side.
>
> The Problem with MPI_Abort is it shows the correct number in its output in
> Logfile, but it does not communicate its value to other processes, or
> forward its value to exit. So one also always sees these "random" values.
>
> When using positive numbers in range it seems to work, so my question was
> on how it works, and how one can do it? Is there a way to let MPI_Abort
> communicate  the value as exit code?
> Why do negative numbers not work, or does one simply have to always use
> positive numbers? Why I would prefer Abort is because it seems safer.
>
> BR Alex
>
>
> --
> *Von:* George Bosilca 
> *Gesendet:* Dienstag, 18. Juli 2023 18:47
> *An:* Open MPI Users 
> *Cc:* Alexander Stadik 
> *Betreff:* [EXT] Re: [OMPI users] Error handling
>
> External: Check sender address and use caution opening links or
> attachments
>
> Alex,
>
> How are your values "random" if you provide correct values ? Even for
> negative values you could use MIN to pick one value and return it. What is
> the problem with `MPI_Abort` ? it does seem to do what you want.
>
>   George.
>
>
> On Tue, Jul 18, 2023 at 4:38 AM Alexander Stadik via users <
> users@lists.open-mpi.org> wrote:
>
> Hey everyone,
>
> I am working for longer time now with cuda-aware OpenMPI, and developed
> longer time back a small exceptions handling framework including MPI and
> CUDA exceptions.
> Currently I am using MPI_Abort with costum error numbers, to terminate
> everything elegantly, which works well, by just reading the logfile in case
> of a crash.
>
> Now I was wondering how one can handle return / exit codes properly
> between processes, since we would like to filter non-zero exits by return
> code.
>
> One way is a simple 

Re: [OMPI users] [EXT] Re: Error handling

2023-07-19 Thread George Bosilca via users
Alex,

exit(status) does not make status available to the parent process wait,
instead it makes the low 8 bits available to the parent as unsigned. This
explains why small positive values seem to work correctly while negative
values do not (because of the 32 bits negative value representation in
complement to two).

  George.


On Wed, Jul 19, 2023 at 12:45 AM Alexander Stadik <
alexander.sta...@essteyr.com> wrote:

> Hey George,
>
> I said random only because I do not see the method behind it, but exactly
> like this when I do allreduce by MIN and return a negative number I get
> either 248, 253, 11 or 6 usually. Meaning that's purely a number from MPI
> side.
>
> The Problem with MPI_Abort is it shows the correct number in its output in
> Logfile, but it does not communicate its value to other processes, or
> forward its value to exit. So one also always sees these "random" values.
>
> When using positive numbers in range it seems to work, so my question was
> on how it works, and how one can do it? Is there a way to let MPI_Abort
> communicate  the value as exit code?
> Why do negative numbers not work, or does one simply have to always use
> positive numbers? Why I would prefer Abort is because it seems safer.
>
> BR Alex
>
>
> --
> *Von:* George Bosilca 
> *Gesendet:* Dienstag, 18. Juli 2023 18:47
> *An:* Open MPI Users 
> *Cc:* Alexander Stadik 
> *Betreff:* [EXT] Re: [OMPI users] Error handling
>
> External: Check sender address and use caution opening links or
> attachments
>
> Alex,
>
> How are your values "random" if you provide correct values ? Even for
> negative values you could use MIN to pick one value and return it. What is
> the problem with `MPI_Abort` ? it does seem to do what you want.
>
>   George.
>
>
> On Tue, Jul 18, 2023 at 4:38 AM Alexander Stadik via users <
> users@lists.open-mpi.org> wrote:
>
> Hey everyone,
>
> I am working for longer time now with cuda-aware OpenMPI, and developed
> longer time back a small exceptions handling framework including MPI and
> CUDA exceptions.
> Currently I am using MPI_Abort with costum error numbers, to terminate
> everything elegantly, which works well, by just reading the logfile in case
> of a crash.
>
> Now I was wondering how one can handle return / exit codes properly
> between processes, since we would like to filter non-zero exits by return
> code.
>
> One way is a simple Allreduce (in my case) + exit instead of Abort. But
> the problem seems to be the values are always "random" (since I was using
> negative codes), only by using MPI error codes it seems to work correctly.
> But usage of that is limited.
>
> Any suggestions on how to do this / how it can work properly?
>
> BR Alex
>
>
> <https://www.essteyr.com/>
>
> <https://at.linkedin.com/company/ess-engineeringsoftwaresteyr>
> <https://twitter.com/essteyr>  <https://www.facebook.com/essteyr>
> <https://www.instagram.com/ess_engineering_software_steyr/>
>
> DI Alexander Stadik
>
> Head of Large Scale Solutions
> Research & Development | Large Scale Solutions
>
> Book a Meeting
> <https://outlook.office365.com/owa/calendar/di%20alexandersta...@essteyr.com/bookings/>
>
>
> Phone:  +4372522044622
> Company: +43725220446
>
> Mail: alexander.sta...@essteyr.com
>
>
> Register of Firms No.: FN 427703 a
> Commercial Court: District Court Steyr
> UID: ATU69213102
>
> ESS Engineering Software Steyr GmbH • Berggasse 35 • 4400 • Steyr • Austria
>
> This message is confidential. It may also be privileged or otherwise
> protected by work product immunity or other legal rules. If you have
> received it by mistake, please let us know by e-mail reply and delete it
> from your system; you may not copy this message or disclose its contents to
> anyone. Please send us by fax any message containing deadlines as incoming
> e-mails are not screened for response deadlines. The integrity and security
> of this message cannot be guaranteed on the Internet.
>
> <https://www.essteyr.com/event/1-worldwide-coatings-simulation-conference/>
>
>
>


Re: [OMPI users] Error handling

2023-07-18 Thread George Bosilca via users
Alex,

How are your values "random" if you provide correct values ? Even for
negative values you could use MIN to pick one value and return it. What is
the problem with `MPI_Abort` ? it does seem to do what you want.

  George.


On Tue, Jul 18, 2023 at 4:38 AM Alexander Stadik via users <
users@lists.open-mpi.org> wrote:

> Hey everyone,
>
> I am working for longer time now with cuda-aware OpenMPI, and developed
> longer time back a small exceptions handling framework including MPI and
> CUDA exceptions.
> Currently I am using MPI_Abort with costum error numbers, to terminate
> everything elegantly, which works well, by just reading the logfile in case
> of a crash.
>
> Now I was wondering how one can handle return / exit codes properly
> between processes, since we would like to filter non-zero exits by return
> code.
>
> One way is a simple Allreduce (in my case) + exit instead of Abort. But
> the problem seems to be the values are always "random" (since I was using
> negative codes), only by using MPI error codes it seems to work correctly.
> But usage of that is limited.
>
> Any suggestions on how to do this / how it can work properly?
>
> BR Alex
>
>
> 
>
> 
>   
> 
>
> DI Alexander Stadik
>
> Head of Large Scale Solutions
> Research & Development | Large Scale Solutions
>
> Book a Meeting
> 
>
>
> Phone:  +4372522044622
> Company: +43725220446
>
> Mail: alexander.sta...@essteyr.com
>
>
> Register of Firms No.: FN 427703 a
> Commercial Court: District Court Steyr
> UID: ATU69213102
>
> ESS Engineering Software Steyr GmbH • Berggasse 35 • 4400 • Steyr • Austria
>
> This message is confidential. It may also be privileged or otherwise
> protected by work product immunity or other legal rules. If you have
> received it by mistake, please let us know by e-mail reply and delete it
> from your system; you may not copy this message or disclose its contents to
> anyone. Please send us by fax any message containing deadlines as incoming
> e-mails are not screened for response deadlines. The integrity and security
> of this message cannot be guaranteed on the Internet.
>
> 
>
>


Re: [OMPI users] OMPI compilation error in Making all datatypes

2023-07-12 Thread George Bosilca via users
I can't replicate this on my setting, but I am not using the tar archive
from the OMPI website (I use the git tag). Can you do `ls -l
opal/datatype/.lib` in your build directory.

  George.

On Wed, Jul 12, 2023 at 7:14 AM Elad Cohen via users <
users@lists.open-mpi.org> wrote:

> Hi Jeff, thanks for replying
>
>
> opal/datatype/.libs/libdatatype_reliable.a doesn't exist.
>
>
> I tried building on a networked filesystem , and a local one .
>
>
> when building in /root - i'm getting ore output, but eventually the same
> error:
>
>
> make[2]: Entering directory
> '/root/openmpi-v4.1.5/openmpi-4.1.5/opal/datatype'
>   CC   libdatatype_reliable_la-opal_datatype_pack.lo
>   CC   libdatatype_reliable_la-opal_datatype_unpack.lo
>   CC   opal_convertor_raw.lo
>   CC   opal_convertor.lo
>   CC   opal_copy_functions.lo
>   CC   opal_copy_functions_heterogeneous.lo
>   CC   opal_datatype_add.lo
>   CC   opal_datatype_clone.lo
>   CC   opal_datatype_copy.lo
>   CC   opal_datatype_create.lo
>   CC   opal_datatype_create_contiguous.lo
>   CC   opal_datatype_destroy.lo
>   CC   opal_datatype_dump.lo
>   CC   opal_datatype_fake_stack.lo
>   CC   opal_datatype_get_count.lo
>   CC   opal_datatype_module.lo
>   CC   opal_datatype_monotonic.lo
>   CC   opal_datatype_optimize.lo
>   CC   opal_datatype_pack.lo
>   CC   opal_datatype_position.lo
>   CC   opal_datatype_resize.lo
>   CC   opal_datatype_unpack.lo
>   CCLD libdatatype_reliable.la
> ranlib: '.libs/libdatatype_reliable.a': No such file
> make[2]: *** [Makefile:1870: libdatatype_reliable.la] Error
>
>
>
> --
> *From:* Jeff Squyres (jsquyres) 
> *Sent:* Wednesday, July 12, 2023 1:09:35 PM
> *To:* users@lists.open-mpi.org
> *Cc:* Elad Cohen
> *Subject:* Re: OMPI compilation error in Making all datatypes
>
> The output you sent (in the attached tarball) in doesn't really make much
> sense:
>
> libtool: link: ar cru .libs/libdatatype_reliable.a
> .libs/libdatatype_reliable_la-opal_datatype_pack.o
> .libs/libdatatype_reliable_la-opal_datatype_unpack.o
>
> libtool: link: ranlib .libs/libdatatype_reliable.a
>
> ranlib: '.libs/libdatatype_reliable.a': No such file
>
>
> Specifically:
>
>1. "ar cru .libs/libdatatype_reliable.a" should have created the file
>.libs/libdatatype_reliable.a.
>2. "ranlib .libs/libdatatype_reliable.a" then should modify the
>.libs/libdatatype_reliable.a that was just created.
>
> I'm not sure how #2 fails to find the file that was just created in step
> #1.  No errors were reported by step #1, so that file should be there.
>
> Can you confirm if the file opal/datatype/.libs/libdatatype_reliable.a
> exists?
> Are you building on a networked filesystem, perchance?  If so, is the time
> synchronized between the machine on which you are building and the file
> server?
>
> --
> *From:* users  on behalf of Elad Cohen
> via users 
> *Sent:* Wednesday, July 12, 2023 4:27 AM
> *To:* users@lists.open-mpi.org 
> *Cc:* Elad Cohen 
> *Subject:* [OMPI users] OMPI compilation error in Making all datatypes
>
>
> Hello,
>
> I'm getting this error in both v4.1.4 and v4.1.5:
>
> Making all in datatype
> make[2]: Entering directory
> '/data/bin/openmpi-v4.1.5/openmpi-4.1.5/opal/datatype'
>   CCLD libdatatype_reliable.la
> ranlib: '.libs/libdatatype_reliable.a': No such file
> make[2]: *** [Makefile:1870: libdatatype_reliable.la] Error 1
> make[2]: Leaving directory
> '/data/bin/openmpi-v4.1.5/openmpi-4.1.5/opal/datatype'
> make[1]: *** [Makefile:2394: all-recursive] Error 1
> make[1]: Leaving directory '/data/bin/openmpi-v4.1.5/openmpi-4.1.5/opal'
> make: *** [Makefile:1912: all-recursive] Error 1
>
> Thank you
>
>
>
> Email secured by Check Point - Threat Emulation policy
>
>


Re: [OMPI users] Q: Getting MPI-level memory use from OpenMPI?

2023-04-17 Thread George Bosilca via users
Some folks from ORNL have done some studies about OMPI memory usage a few
years ago, but I am not sure if these studies are openly available. OMPI
manages all the MCA parameters, user facing requests, unexpected messages,
temporary buffers for collectives and IO. And those are, I might be
slightly extrapolating here, linearly dependent on the number of
communicators and non-blocking and persistent requests existing.

As a general statement low-level communication libraries are not supposed
to use much memory, and the amount would be capped to some extent
(logically by number of endpoints or connections). In particular, UCX has a
similar memory tracking mechanism to OMPI, via ucs_malloc and friends. Take
a look at ucs/debug/memtrack.c to figure out how to enable it (maybe
enabling statistics, aka. ENABLE_STATS, is enough).

  George.




On Mon, Apr 17, 2023 at 1:16 PM Brian Dobbins  wrote:

>
> Hi George,
>
>   Got it, thanks for the info - I naively hadn't even considered that of
> course all the related libraries likely have their *own* allocators.  So,
> for *OpenMPI, *it sounds like I can use my own opal_[mc]alloc calls, with
> a new build turning mem debugging on, to tally up and report the total size
> of OpenMPI allocations, and that seems pretty straightforward.  But I'd
> guess that for a data-heavy MPI application, the majority of the memory
> will be in transport-level buffers, and that's (for me) likely the UCX
> layer, so I should look to that community / code for quantifying how large
> those buffers get inside my application?
>
>   Thanks again, and apologies for what is surely a woeful misuse of the
> correct terminology here on some of this stuff.
>
>   - Brian
>
>
> On Mon, Apr 17, 2023 at 11:05 AM George Bosilca 
> wrote:
>
>> Brian,
>>
>> OMPI does not have an official mechanism to report how much memory OMPI
>> allocates. But, there is hope:
>>
>> 1. We have a mechanism to help debug memory issues
>> (OPAL_ENABLE_MEM_DEBUG). You could enable it and then provide your own
>> flavor of memory tracking in opal/util/malloc.c
>> 2. You can use a traditional malloc trapping mechanism (valgrind, malt,
>> mtrace,...), and investigate the stack to detect where the allocation was
>> issued and then count.
>>
>> The first approach would only give you the memory used by OMPI itself,
>> not the other libraries we are using (PMIx, HWLOC, UCX, ...). The second
>> might be a little more generic, but depend on external tools and might take
>> a little time to setup.
>>
>> George.
>>
>>
>> On Fri, Apr 14, 2023 at 3:31 PM Brian Dobbins via users <
>> users@lists.open-mpi.org> wrote:
>>
>>>
>>> Hi all,
>>>
>>>   I'm wondering if there's a simple way to get statistics from OpenMPI
>>> as to how much memory the *MPI* layer in an application is taking.  For
>>> example, I'm running a model and I can get the RSS size at various points
>>> in the code, and that reflects the user data for the application, *plus*,
>>> surely, buffers for MPI messages that are either allocated at runtime or,
>>> maybe, a pool from start-up.  The memory use -which I assume is tied to
>>> internal buffers? differs considerably with *how* I run MPI - eg, TCP
>>> vs UCX, and with UCX, a UD vs RC mode.
>>>
>>>   Here's an example of this:
>>>
>>> 60km (163842 columns), 2304 ranks [OpenMPI]
>>> UCX Transport Changes (environment variable)
>>> (No recompilation; all runs done on same nodes)
>>> Showing memory after ATM-TO-MED Step
>>> [RSS Memory in MB]
>>>
>>> Standard Decomposition
>>> UCX_TLS value ud default rc
>>> Run 1 347.03 392.08 750.32
>>> Run 2 346.96 391.86 748.39
>>> Run 3 346.89 392.18 750.23
>>>
>>>   I'd love a way to trace how much *MPI alone* is using, since here I'm
>>> still measuring the *process's* RSS.  My feeling is that if, for
>>> example, I'm running on N nodes and have a 1GB dataset + (for the sake of
>>> discussion) 100MB of MPI info, then at 2N, with good scaling of domain
>>> memory, that's 500MB + 100MB, at 4N it's 250MB/100MB, and eventually, at
>>> 16N, the MPI memory dominates.  As a result, when we scale out, even with
>>> perfect scaling of *domain* memory, at some point memory associated
>>> with MPI will cause this curve to taper off, and potentially invert.  But
>>> I'm admittedly *way* out of date on how modern MPI implementations
>>> allocate buffers.
>>>
>>>   In short, any tips on ways to better characterize MPI memory use would
>>> be *greatly* appreciated!  If this is purely on the UCX (or other
>>> transport) level, that's good to know too.
>>>
>>>   Thanks,
>>>   - Brian
>>>
>>>
>>>


Re: [OMPI users] Q: Getting MPI-level memory use from OpenMPI?

2023-04-17 Thread George Bosilca via users
Brian,

OMPI does not have an official mechanism to report how much memory OMPI
allocates. But, there is hope:

1. We have a mechanism to help debug memory issues (OPAL_ENABLE_MEM_DEBUG).
You could enable it and then provide your own flavor of memory tracking in
opal/util/malloc.c
2. You can use a traditional malloc trapping mechanism (valgrind, malt,
mtrace,...), and investigate the stack to detect where the allocation was
issued and then count.

The first approach would only give you the memory used by OMPI itself, not
the other libraries we are using (PMIx, HWLOC, UCX, ...). The second might
be a little more generic, but depend on external tools and might take a
little time to setup.

George.


On Fri, Apr 14, 2023 at 3:31 PM Brian Dobbins via users <
users@lists.open-mpi.org> wrote:

>
> Hi all,
>
>   I'm wondering if there's a simple way to get statistics from OpenMPI as
> to how much memory the *MPI* layer in an application is taking.  For
> example, I'm running a model and I can get the RSS size at various points
> in the code, and that reflects the user data for the application, *plus*,
> surely, buffers for MPI messages that are either allocated at runtime or,
> maybe, a pool from start-up.  The memory use -which I assume is tied to
> internal buffers? differs considerably with *how* I run MPI - eg, TCP vs
> UCX, and with UCX, a UD vs RC mode.
>
>   Here's an example of this:
>
> 60km (163842 columns), 2304 ranks [OpenMPI]
> UCX Transport Changes (environment variable)
> (No recompilation; all runs done on same nodes)
> Showing memory after ATM-TO-MED Step
> [RSS Memory in MB]
>
> Standard Decomposition
> UCX_TLS value ud default rc
> Run 1 347.03 392.08 750.32
> Run 2 346.96 391.86 748.39
> Run 3 346.89 392.18 750.23
>
>   I'd love a way to trace how much *MPI alone* is using, since here I'm
> still measuring the *process's* RSS.  My feeling is that if, for example,
> I'm running on N nodes and have a 1GB dataset + (for the sake of
> discussion) 100MB of MPI info, then at 2N, with good scaling of domain
> memory, that's 500MB + 100MB, at 4N it's 250MB/100MB, and eventually, at
> 16N, the MPI memory dominates.  As a result, when we scale out, even with
> perfect scaling of *domain* memory, at some point memory associated with
> MPI will cause this curve to taper off, and potentially invert.  But I'm
> admittedly *way* out of date on how modern MPI implementations allocate
> buffers.
>
>   In short, any tips on ways to better characterize MPI memory use would
> be *greatly* appreciated!  If this is purely on the UCX (or other
> transport) level, that's good to know too.
>
>   Thanks,
>   - Brian
>
>
>


Re: [OMPI users] What is the best choice of pml and btl for intranode communication

2023-03-06 Thread George Bosilca via users
Edgar is right, UCX_TLS has some role in the selection. You can see the
current selection by running `uxc_info -c`. In my case, UCX_TLS is set to
`all` somehow, and I had either a not-connected IB device or a GPU.
However, I did not set UCX_TLS manually, and I can't see it anywhere in my
system configuration, I assume `all` should be the default for UCX.

Long story short, the original complaint is correct, without an IB device
(even not connected) the selection logic of the UCX PML will not allow the
run to proceed. In fact, if I manually exclude these devices I cannot get
the selection logic to allow the UCX PML to be selected.

George.

More info:

Here is the verbose output on my system. It seems as if `init` succeeded
because it found an IB device, but it did note that no shared memory
transports were allowed. However, later on, once it listed the transport
used for communications, it only selects shared memory transports.

 ../../../../../../../opal/mca/common/ucx/common_ucx.c:311 self/memory: did
not match transport list
 ../../../../../../../opal/mca/common/ucx/common_ucx.c:311 tcp/lo: did not
match transport list
 ../../../../../../../opal/mca/common/ucx/common_ucx.c:311 tcp/ib1: did not
match transport list
 ../../../../../../../opal/mca/common/ucx/common_ucx.c:311 tcp/eth0: did
not match transport list
 ../../../../../../../opal/mca/common/ucx/common_ucx.c:311 sysv/memory: did
not match transport list
 ../../../../../../../opal/mca/common/ucx/common_ucx.c:311 posix/memory:
did not match transport list
 ../../../../../../../opal/mca/common/ucx/common_ucx.c:306
rc_verbs/mthca0:2: matched transport list but not device list
 ../../../../../../../opal/mca/common/ucx/common_ucx.c:306
ud_verbs/mthca0:2: matched transport list but not device list
 ../../../../../../../opal/mca/common/ucx/common_ucx.c:311 cma/memory: did
not match transport list
 ../../../../../../../opal/mca/common/ucx/common_ucx.c:315 support level is
transports only
 ../../../../../../../ompi/mca/pml/ucx/pml_ucx.c:293 mca_pml_ucx_init
 ../../../../../../../ompi/mca/pml/ucx/pml_ucx.c:358 created ucp context
0x7300c0, worker 0x77f6f010
 ../../../../../../../ompi/mca/pml/ucx/pml_ucx_component.c:147 returning
priority 19
 select: init returned priority 19
 selected ucx best priority 19
 select: component ucx selected

 
+--+-+
   | 0x7300c0 self cfg#0  | tagged message by ucp_tag_send* from host
memory|

 
+--+---+-+
   |  0..7415 | eager short   |
self/memory |
   |7416..inf | (?) rendezvous zero-copy read from remote |
cma/memory  |

 
+--+---+-+

 
++-+
   | 0x7300c0 intra-node cfg#1  | tagged message by ucp_tag_send* from host
memory|

 
++---+-+
   |  0..92 | eager short
| sysv/memory |
   |   93..8248 | eager copy-in copy-out
 | sysv/memory |
   |8249..13737 | multi-frag eager copy-in copy-out
| sysv/memory |
   | 13738..inf | (?) rendezvous zero-copy read from remote
| cma/memory  |

 
++---+-+

The transport used are self, sysv and cma. Now if I specify them by hand
`UCX_TLS=self,cma,sysv` (with or without PROTO enabled) the UCX PML refuses
to be selected.




On Mon, Mar 6, 2023 at 11:41 AM Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> Per George's comments, I stand corrected: UCX *does*​ work fine in
> single-node cases -- he confirmed to me that he tested it on his laptop,
> and it worked for him.
>
> That being said, you're passing "--mca pml ucx" in the correct place now,
> and you're therefore telling Open MPI "_only_ use the UCX PML".  Hence, if
> the UCX PML can't be used, it's an aborting type of error.  The question
> is: *why*​ is the UCX PML not usable on your node?  Your output clearly
> shows that UCX chooses to disable itself -- is that because there are no IB
> / RoCE interfaces at all?  (this is an open question to George / the UCX
> team)
> --
> *From:* Chandran, Arun 
> *Sent:* Monday, March 6, 2023 10:31 AM
> *To:* Jeff Squyres (jsquyres) ; Open MPI Users <
> users@lists.open-mpi.org>
> *Subject:* RE: [OMPI users] What is the best choice of pml and btl for
> intranode communication
>
>
> [Public]
>
>
>
> Hi,
>
>
>
> Yes, it is run on a single node, there is no IB anr RoCE attached to it.
>
>
>
> Pasting the complete o/p (I might have mistakenly copy pasted the command
> in the previous mail)
>
>
>
> #
>
> perf_benchmark $ mpirun -np 2 --map-by core 

Re: [OMPI users] What is the best choice of pml and btl for intranode communication

2023-03-06 Thread George Bosilca via users
ucx PML should work just fine even on a single node scenario. As Jeff
indicated you need to move the MCA param `--mca pml ucx` before your
command.

  George.


On Mon, Mar 6, 2023 at 9:48 AM Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> If this run was on a single node, then UCX probably disabled itself since
> it wouldn't be using InfiniBand or RoCE to communicate between peers.
>
> Also, I'm not sure your command line was correct:
>
> perf_benchmark $ mpirun -np 32 --map-by core --bind-to core ./perf  --mca
> pml ucx
>
> You probably need to list all of mpirun's CLI options *before*​ you list
> the ./perf executable.  In its right-to-left traversal, once mpirun hits a
> CLI option it does not recognize (e.g., "./perf"), it assumes that it is
> the user's executable name, and does not process the CLI options to the
> right of that.
>
> Hence, the output you show must have forced the UCX PML another way --
> perhaps you set an environment variable or something?
>
> --
> *From:* users  on behalf of Chandran,
> Arun via users 
> *Sent:* Monday, March 6, 2023 3:33 AM
> *To:* Open MPI Users 
> *Cc:* Chandran, Arun 
> *Subject:* Re: [OMPI users] What is the best choice of pml and btl for
> intranode communication
>
>
> [Public]
>
>
>
> Hi Gilles,
>
>
>
> Thanks very much for the information.
>
>
>
> I was looking for the best pml + btl combination for a standalone intra
> node with high task count (>= 192) with no HPC-class networking installed.
>
>
>
> Just now realized that I can’t use pml ucx for such cases as it is unable
> find IB and fails.
>
>
>
> perf_benchmark $ mpirun -np 32 --map-by core --bind-to core ./perf  --mca
> pml ucx
>
> --
>
>
> No components were able to be opened in the pml
> framework.
>
>
>
>
> This typically means that either no components of this type
> were
>
> installed, or none of the installed components can be
> loaded.
>
> Sometimes this means that shared libraries required by
> these
>
> components are unable to be
> found/loaded.
>
>
>
>
>   Host:
> lib-ssp-04
>
>   Framework:
> pml
>
> --
>
>
> [lib-ssp-04:753542] PML ucx cannot be
> selected
>
> [lib-ssp-04:753531] PML ucx cannot be
> selected
>
> [lib-ssp-04:753541] PML ucx cannot be
> selected
>
> [lib-ssp-04:753539] PML ucx cannot be
> selected
>
> [lib-ssp-04:753545] PML ucx cannot be
> selected
>
> [lib-ssp-04:753547] PML ucx cannot be
> selected
>
> [lib-ssp-04:753572] PML ucx cannot be
> selected
>
> [lib-ssp-04:753538] PML ucx cannot be selected
>
>
> [lib-ssp-04:753530] PML ucx cannot be
> selected
>
> [lib-ssp-04:753537] PML ucx cannot be
> selected
>
> [lib-ssp-04:753546] PML ucx cannot be selected
>
>
> [lib-ssp-04:753544] PML ucx cannot be
> selected
>
> [lib-ssp-04:753570] PML ucx cannot be
> selected
>
> [lib-ssp-04:753567] PML ucx cannot be selected
>
>
> [lib-ssp-04:753534] PML ucx cannot be
> selected
>
> [lib-ssp-04:753592] PML ucx cannot be selected
>
> [lib-ssp-04:753529] PML ucx cannot be selected
>
> 
>
>
>
> That means my only choice is pml/ob1 + btl/vader.
>
>
>
> --Arun
>
>
>
> *From:* users  *On Behalf Of *Gilles
> Gouaillardet via users
> *Sent:* Monday, March 6, 2023 12:56 PM
> *To:* Open MPI Users 
> *Cc:* Gilles Gouaillardet 
> *Subject:* Re: [OMPI users] What is the best choice of pml and btl for
> intranode communication
>
>
>
> *Caution:* This message originated from an External Source. Use proper
> caution when opening attachments, clicking links, or responding.
>
>
>
> Arun,
>
>
>
> First Open MPI selects a pml for **all** the MPI tasks (for example,
> pml/ucx or pml/ob1)
>
>
>
> Then, if pml/ob1 ends up being selected, a btl component (e.g. btl/uct,
> btl/vader) is used for each pair of MPI tasks
>
> (tasks on the same node will use btl/vader, tasks on different nodes will
> use btl/uct)
>
>
>
> Note that if UCX is available, pml/ucx takes the highest priority, so no
> btl is involved
>
> (in your case, if means intra-node communications will be handled by UCX
> and not btl/vader).
>
> You can force ob1 and try different combinations of btl with
>
> mpirun --mca pml ob1 --mca btl self,, ...
>
>
>
> I expect pml/ucx is faster than pml/ob1 with btl/uct for inter node
> communications.
>
>
>
> I have not benchmarked Open MPI for a while and it is possible btl/vader
> outperforms pml/ucx for intra nodes communications,
>
> so if you run on a small number of Infiniband interconnected nodes with a
> large number of tasks per node, you might be able
>
> to get the best performances by forcing pml/ob1.
>
>
>
> Bottom line, I think it is best for you to benchmark your application and
> pick the combination that leads to the best performances,
>
> and you are more than welcome to share your conclusions.
>
>
>
> Cheers,
>
>
>
> Gilles
>
>
>
>
>
> On Mon, Mar 6, 2023 at 3:12 PM Chandran, Arun via users <

Re: [OMPI users] Subcommunicator communications do not complete intermittently

2022-09-11 Thread George Bosilca via users
Assuming a correct implementation the described communication pattern
should work seamlessly.

Would it be possible to either share a reproducer or provide the execution
stack by attaching a debugger to the deadlocked application to see the
state of the different processes. I wonder if all processes join eventually
the gather on comm_world or dinner of them are stuck on some orthogonal
collective communication pattern.

George




On Fri, Sep 9, 2022, 21:24 Niranda Perera via users <
users@lists.open-mpi.org> wrote:

> Hi all,
>
> I have the following use case. I have N mpi ranks in the global
> communicator, and I split it into two, first being rank 0, and the other
> being all ranks from 1-->N-1.
> Rank0 acts as a master and ranks [1, N-1] act as workers. I use rank0 to
> broadcast (blocking) a set of values to ranks [1, N-1] ocer comm_world.
> Rank0 then immediately calls a gather (blocking) over comm_world and
> busywait for results. Once the broadcast is received by workers, they call
> a method foo(args, local_comm). Inside foo, workers communicate with each
> other using the subcommunicator, and each produce N-1 results, which would
> be sent to Rank0 as gather responses over comm_world. Inside foo there are
> multiple iterations, collectives, send-receives, etc.
>
> This seems to be working okay with smaller parallelism and smaller tasks
> of foo. But when the parallelism increases (eg: 64... 512), only a single
> iteration completes inside foo. Subsequent iterations, seems to be hanging.
>
> Is this an anti-pattern in MPI? Should I use igather, ibcast instead of
> blocking calls?
>
> Any help is greatly appreciated.
>
> --
> Niranda Perera
> https://niranda.dev/
> @n1r44 
>
>


Re: [OMPI users] OpenMPI and names of the nodes in a cluster

2022-06-16 Thread George Bosilca via users
This error seems to be initiated from the PMIX regex framework. Not sure
exactly which one is used, but a good starting point is in one of the files
in 3rd-party/openpmix/src/mca/preg/. Look for the generate_node_regex
function in the different components, one of them is raising the error.

George.


On Thu, Jun 16, 2022 at 9:50 AM Patrick Begou via users <
users@lists.open-mpi.org> wrote:

> Hi  Gilles and Jeff,
>
> @Gilles I will have a look at these files, thanks.
>
> @Jeff this is the error message (screen dump attached) and of course the
> nodes names do not agree with the standard.
>
> Patrick
>
>
>
> Le 16/06/2022 à 14:30, Jeff Squyres (jsquyres) a écrit :
>
> What exactly is the error that is occurring?
>
> --
> Jeff squyresjsquy...@cisco.com
>
> 
> From: users  
>  on behalf of Patrick Begou via users 
>  
> Sent: Thursday, June 16, 2022 3:21 AM
> To: Open MPI Users
> Cc: Patrick Begou
> Subject: [OMPI users] OpenMPI and names of the nodes in a cluster
>
> Hi all,
>
> we are facing a serious problem with OpenMPI (4.0.2) that we have
> deployed on a cluster. We do not manage this large cluster and the names
> of the nodes do not agree with Internet standards for protocols: they
> contain a "_" (underscore) character.
>
> So OpenMPI complains about this and do not run.
>
> I've tried to use IP instead of host names in the host file without any
> success.
>
> Is there a known workaround for this as requesting the administrators to
> change the nodes names on this large cluster may be difficult.
>
> Thanks
>
> Patrick
>
>
>
>
>


Re: [OMPI users] Quality and details of implementation for Neighborhood collective operations

2022-06-08 Thread George Bosilca via users
There is a lot of FUD regarding the so-called optimizations for
neighborhood collectives. In general, they all converge toward creating a
globally consistent communication order. If the neighborhood topology is
regular, some parts of the globally consistent communication order can be
inferred, but for all graph topologies (assuming irregular) the creation
overhead of this globally consistent communication order is significant and
can only be hidden if the collective pattern is reused multiple times (aka
persistent communications). So, while there are some opportunities for
optimizations for specific cases, the implementations we provide in OMPI
(both basic and libnbc), despite their apparent simplicity, should perform
reasonably well in most cases.

George.


On Wed, Jun 8, 2022 at 3:58 PM Michael Thomadakis 
wrote:

> I see, thanks
>
> Is there any plan to apply any optimizations on the Neighbor collectives
> at some point?
>
> regards
> Michael
>
> On Wed, Jun 8, 2022 at 1:29 PM George Bosilca  wrote:
>
>> Michael,
>>
>> As far as I know none of the implementations of the
>> neighborhood collectives in OMPI are architecture-aware. The only 2
>> components that provide support for neighborhood collectives are basic (for
>> the blocking version) and libnbc (for the non-blocking versions).
>>
>>   George.
>>
>>
>> On Wed, Jun 8, 2022 at 1:27 PM Michael Thomadakis via users <
>> users@lists.open-mpi.org> wrote:
>>
>>> Hello OpenMPI
>>>
>>> I was wondering if the MPI_Neighbor_x calls have received any
>>> special design and optimizations in OpenMPI 4.1.x+ for these patterns of
>>> communication.
>>>
>>> For instance, these could benefit from proximity awareness and intra- vs
>>> inter-node communications. However, even single node communications have
>>> hierarchical structure due to the increased number of num-domains, larger
>>> L3 caches and so on.
>>>
>>> Is OpenMPI 4.1.x+ leveraging any special logic to optimize these calls?
>>> Is UCX or UCC/HCOLL doing anything special or is OpenMPI using these lower
>>> layers in a more "intelligent" way to provide
>>> optimized neighborhood collectives?
>>>
>>> Thanks you much
>>> Michael
>>>
>>


Re: [OMPI users] Quality and details of implementation for Neighborhood collective operations

2022-06-08 Thread George Bosilca via users
Michael,

As far as I know none of the implementations of the
neighborhood collectives in OMPI are architecture-aware. The only 2
components that provide support for neighborhood collectives are basic (for
the blocking version) and libnbc (for the non-blocking versions).

  George.


On Wed, Jun 8, 2022 at 1:27 PM Michael Thomadakis via users <
users@lists.open-mpi.org> wrote:

> Hello OpenMPI
>
> I was wondering if the MPI_Neighbor_x calls have received any special
> design and optimizations in OpenMPI 4.1.x+ for these patterns of
> communication.
>
> For instance, these could benefit from proximity awareness and intra- vs
> inter-node communications. However, even single node communications have
> hierarchical structure due to the increased number of num-domains, larger
> L3 caches and so on.
>
> Is OpenMPI 4.1.x+ leveraging any special logic to optimize these calls?
> Is UCX or UCC/HCOLL doing anything special or is OpenMPI using these lower
> layers in a more "intelligent" way to provide
> optimized neighborhood collectives?
>
> Thanks you much
> Michael
>


Re: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

2022-05-05 Thread George Bosilca via users
That is weird, but maybe it is not a deadlock, but a very slow progress. In
the child can you print the fdmax and i in the frame do_child.

George.

On Thu, May 5, 2022 at 11:50 AM Scott Sayres via users <
users@lists.open-mpi.org> wrote:

> Jeff, thanks.
> from 1:
>
> (lldb) process attach --pid 95083
>
> Process 95083 stopped
>
> * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
>
> frame #0: 0x0001bde25628 libsystem_kernel.dylib`close + 8
>
> libsystem_kernel.dylib`close:
>
> ->  0x1bde25628 <+8>:  b.lo   0x1bde25648   ; <+40>
>
> 0x1bde2562c <+12>: pacibsp
>
> 0x1bde25630 <+16>: stpx29, x30, [sp, #-0x10]!
>
> 0x1bde25634 <+20>: movx29, sp
>
> Target 0: (orterun) stopped.
>
> Executable module set to "/usr/local/bin/orterun".
>
> Architecture set to: arm64e-apple-macosx-.
>
> (lldb) thread backtrace
>
> * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
>
>   * frame #0: 0x0001bde25628 libsystem_kernel.dylib`close + 8
>
> frame #1: 0x000101563074
> mca_odls_default.so`do_child(cd=0x61e28000, write_fd=40) at
> odls_default_module.c:410:17
>
> frame #2: 0x000101562d7c
> mca_odls_default.so`odls_default_fork_local_proc(cdptr=0x61e28000)
> at odls_default_module.c:646:9
>
> frame #3: 0x000100e2c6f8
> libopen-rte.40.dylib`orte_odls_base_spawn_proc(fd=-1, sd=4,
> cbdata=0x61e28000) at odls_base_default_fns.c:1046:31
>
> frame #4: 0x0001011827a0
> libopen-pal.40.dylib`opal_libevent2022_event_base_loop [inlined]
> event_process_active_single_queue(base=0x00010df069d0) at event.c:1370
> :4 [opt]
>
> frame #5: 0x000101182628
> libopen-pal.40.dylib`opal_libevent2022_event_base_loop [inlined]
> event_process_active(base=0x00010df069d0) at event.c:1440:8 [opt]
>
> frame #6: 0x0001011825ec
> libopen-pal.40.dylib`opal_libevent2022_event_base_loop(base=0x00010df069d0,
> flags=) at event.c:1644:12 [opt]
>
> frame #7: 0x000100bbfb04 orterun`orterun(argc=4,
> argv=0x00016f2432f8) at orterun.c:179:9
>
> frame #8: 0x000100bbf904 orterun`main(argc=4,
> argv=0x00016f2432f8) at main.c:13:12
>
> frame #9: 0x000100f19088 dyld`start + 516
>
> from 2:
>
> scottsayres@scotts-mbp ~ % lldb -p 95082
>
> (lldb) process attach --pid 95082
>
> Process 95082 stopped
>
> * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
>
> frame #0: 0x0001bde25654 libsystem_kernel.dylib`read + 8
>
> libsystem_kernel.dylib`read:
>
> ->  0x1bde25654 <+8>:  b.lo   0x1bde25674   ; <+40>
>
> 0x1bde25658 <+12>: pacibsp
>
> 0x1bde2565c <+16>: stpx29, x30, [sp, #-0x10]!
>
> 0x1bde25660 <+20>: movx29, sp
>
> Target 0: (orterun) stopped.
>
> Executable module set to "/usr/local/bin/orterun".
>
> Architecture set to: arm64e-apple-macosx-.
>
> (lldb) thread backtrace
>
> * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
>
>   * frame #0: 0x0001bde25654 libsystem_kernel.dylib`read + 8
>
> frame #1: 0x00010116969c libopen-pal.40.dylib`opal_fd_read(fd=22,
> len=20, buffer=0x00016f24299c) at fd.c:51:14
>
> frame #2: 0x000101563388
> mca_odls_default.so`do_parent(cd=0x61e28200, read_fd=22) at
> odls_default_module.c:495:14
>
> frame #3: 0x000101562d90
> mca_odls_default.so`odls_default_fork_local_proc(cdptr=0x61e28200)
> at odls_default_module.c:651:12
>
> frame #4: 0x000100e2c6f8
> libopen-rte.40.dylib`orte_odls_base_spawn_proc(fd=-1, sd=4,
> cbdata=0x61e28200) at odls_base_default_fns.c:1046:31
>
> frame #5: 0x0001011827a0
> libopen-pal.40.dylib`opal_libevent2022_event_base_loop [inlined]
> event_process_active_single_queue(base=0x00010df069d0) at event.c:1370
> :4 [opt]
>
> frame #6: 0x000101182628
> libopen-pal.40.dylib`opal_libevent2022_event_base_loop [inlined]
> event_process_active(base=0x00010df069d0) at event.c:1440:8 [opt]
>
> frame #7: 0x0001011825ec
> libopen-pal.40.dylib`opal_libevent2022_event_base_loop(base=0x00010df069d0,
> flags=) at event.c:1644:12 [opt]
>
> frame #8: 0x000100bbfb04 orterun`orterun(argc=4,
> argv=0x00016f2432f8) at orterun.c:179:9
>
> frame #9: 0x000100bbf904 orterun`main(argc=4,
> argv=0x00016f2432f8) at main.c:13:12
>
> frame #10: 0x000100f19088 dyld`start + 516
>
>


Re: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

2022-05-04 Thread George Bosilca via users
Scott,

This shows the deadlock arrives during the local spawn. Here is how things
are supposed to work: the mpirun process (parent) will fork (the child),
and these 2 processes are connected through a pipe. The child will then
execve the desired command (hostname in your case), and this will close the
child end of the pipe. This event is detected by the parent, and translated
into a successful launch. If the execve fails, then the child will print
something in the connected pipe, and the parent will know something was
wrong. Thus, in both cases the pipe with the child is expected to break
soon after the fork, allowing the parent to know how to handle the case.

If you block there, this means that the child process somehow was unable to
execve the command, getting stuck waiting for something to happen to the
pipe. So we need to look at the child to see what is going on there.

Unfortunately this is complicated because unlike gdb, lldb has weak support
for easy tracking the forked child process. A very recent version of lldb
(via brew or macport) might give you access to `settings set
target.process.follow-fork-mode child` (very early in the debugging
process) to force lldb to follow the child instead of the parent. If not
you might want to install gdb to track the child process (more info here:
https://sourceware.org/gdb/onlinedocs/gdb/Forks.html).

  George.




On Wed, May 4, 2022 at 1:01 PM Scott Sayres  wrote:

> Hi George, Thanks!  You have just taught me a new trick.  Although I do
> not yet understand the output, it is below:
>
> scottsayres@scotts-mbp ~ % lldb mpirun -- -np 1 hostname
>
> (lldb) target create "mpirun"
>
> Current executable set to 'mpirun' (arm64).
>
> (lldb) settings set -- target.run-args  "-np" "1" "hostname"
>
> (lldb) run
>
> Process 14031 launched: '/opt/homebrew/bin/mpirun' (arm64)
>
> 2022-05-04 09:53:11.037030-0700 mpirun[14031:1194363]
> [CL_INVALID_OPERATION] : OpenCL Error : Failed to retrieve device
> information! Invalid enumerated value!
>
>
> 2022-05-04 09:53:11.037133-0700 mpirun[14031:1194363]
> [CL_INVALID_OPERATION] : OpenCL Error : Failed to retrieve device
> information! Invalid enumerated value!
>
>
> 2022-05-04 09:53:11.037142-0700 mpirun[14031:1194363]
> [CL_INVALID_OPERATION] : OpenCL Error : Failed to retrieve device
> information! Invalid enumerated value!
>
>
> Process 14031 stopped
>
> * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
>
> frame #0: 0x0001bde25654 libsystem_kernel.dylib`read + 8
>
> libsystem_kernel.dylib`read:
>
> ->  0x1bde25654 <+8>:  b.lo   0x1bde25674   ; <+40>
>
> 0x1bde25658 <+12>: pacibsp
>
> 0x1bde2565c <+16>: stpx29, x30, [sp, #-0x10]!
>
> 0x1bde25660 <+20>: movx29, sp
>
> Target 0: (mpirun) stopped.
>
> (lldb) thread backtrace
>
> * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
>
>   * frame #0: 0x0001bde25654 libsystem_kernel.dylib`read + 8
>
> frame #1: 0x000100363620 libopen-pal.40.dylib`opal_fd_read + 52
>
> frame #2: 0x00010784b418 
> mca_odls_default.so`odls_default_fork_local_proc
> + 284
>
> frame #3: 0x0001002c7914 
> libopen-rte.40.dylib`orte_odls_base_spawn_proc
> + 968
>
> frame #4: 0x0001003d96dc 
> libevent_core-2.1.7.dylib`event_process_active_single_queue
> + 960
>
> frame #5: 0x0001003d6584 libevent_core-2.1.7.dylib`event_base_loop
> + 952
>
> frame #6: 0x00013cd8 mpirun`orterun + 216
>
> frame #7: 0x000100019088 dyld`start + 516
>
>
> On Wed, May 4, 2022 at 9:36 AM George Bosilca via users <
> users@lists.open-mpi.org> wrote:
>
>> I compiled a fresh copy of the 4.1.3 branch on my M1 laptop, and I can
>> run both MPI and non-MPI apps without any issues.
>>
>> Try running `lldb mpirun -- -np 1 hostname` and once it deadlocks, do a
>> CTRL+C to get back on the debugger and then `backtrace` to see where it is
>> waiting.
>>
>> George.
>>
>>
>> On Wed, May 4, 2022 at 11:28 AM Scott Sayres via users <
>> users@lists.open-mpi.org> wrote:
>>
>>> Thanks for looking at this Jeff.
>>> No, I cannot use mpirun to launch a non-MPI application.The command
>>> "mpirun -np 2 hostname" also hangs.
>>>
>>> I get the following output if I add the -d  command before (I've
>>> replaced the server with the hashtags) :
>>>
>>> [scotts-mbp.3500.dhcp.###:05469] procdir:
>>> /var/folders/l0/94hsdtwj09xd62d90nfh_3h0gn/T//ompi.scotts-mbp.501/pid.5469/0/0
>>>
>>> [scot

Re: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

2022-05-04 Thread George Bosilca via users
I compiled a fresh copy of the 4.1.3 branch on my M1 laptop, and I can run
both MPI and non-MPI apps without any issues.

Try running `lldb mpirun -- -np 1 hostname` and once it deadlocks, do a
CTRL+C to get back on the debugger and then `backtrace` to see where it is
waiting.

George.


On Wed, May 4, 2022 at 11:28 AM Scott Sayres via users <
users@lists.open-mpi.org> wrote:

> Thanks for looking at this Jeff.
> No, I cannot use mpirun to launch a non-MPI application.The command
> "mpirun -np 2 hostname" also hangs.
>
> I get the following output if I add the -d  command before (I've replaced
> the server with the hashtags) :
>
> [scotts-mbp.3500.dhcp.###:05469] procdir:
> /var/folders/l0/94hsdtwj09xd62d90nfh_3h0gn/T//ompi.scotts-mbp.501/pid.5469/0/0
>
> [scotts-mbp.3500.dhcp.###:05469] jobdir:
> /var/folders/l0/94hsdtwj09xd62d90nfh_3h0gn/T//ompi.scotts-mbp.501/pid.5469/0
>
> [scotts-mbp.3500.dhcp.###:05469] top:
> /var/folders/l0/94hsdtwj09xd62d90nfh_3h0gn/T//ompi.scotts-mbp.501/pid.5469
>
> [scotts-mbp.3500.dhcp.###:05469] top:
> /var/folders/l0/94hsdtwj09xd62d90nfh_3h0gn/T//ompi.scotts-mbp.501
>
> [scotts-mbp.3500.dhcp.###:05469] tmp:
> /var/folders/l0/94hsdtwj09xd62d90nfh_3h0gn/T/
>
> [scotts-mbp.3500.dhcp.###:05469] sess_dir_cleanup: job session dir does
> not exist
>
> [scotts-mbp.3500.dhcp.###:05469] sess_dir_cleanup: top session dir not
> empty - leaving
>
> [scotts-mbp.3500.dhcp.###:05469] procdir:
> /var/folders/l0/94hsdtwj09xd62d90nfh_3h0gn/T//ompi.scotts-mbp.501/pid.5469/0/0
>
> [scotts-mbp.3500.dhcp.###:05469] jobdir:
> /var/folders/l0/94hsdtwj09xd62d90nfh_3h0gn/T//ompi.scotts-mbp.501/pid.5469/0
>
> [scotts-mbp.3500.dhcp.###:05469] top:
> /var/folders/l0/94hsdtwj09xd62d90nfh_3h0gn/T//ompi.scotts-mbp.501/pid.5469
>
> [scotts-mbp.3500.dhcp.###:05469] top:
> /var/folders/l0/94hsdtwj09xd62d90nfh_3h0gn/T//ompi.scotts-mbp.501
>
> [scotts-mbp.3500.dhcp.###:05469] tmp:
> /var/folders/l0/94hsdtwj09xd62d90nfh_3h0gn/T/
>
> [scotts-mbp.3500.dhcp.###:05469] [[48286,0],0] Releasing job data for
> [INVALID]
>
> Can you recommend a way to find where mpirun gets stuck?
> Thanks!
> Scott
>
> On Wed, May 4, 2022 at 6:06 AM Jeff Squyres (jsquyres) 
> wrote:
>
>> Are you able to use mpirun to launch a non-MPI application?  E.g.:
>>
>> mpirun -np 2 hostname
>>
>> And if that works, can you run the simple example MPI apps in the
>> "examples" directory of the MPI source tarball (the "hello world" and
>> "ring" programs)?  E.g.:
>>
>> cd examples
>> make
>> mpirun -np 4 hello_c
>> mpirun -np 4 ring_c
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>>
>> 
>> From: users  on behalf of Scott Sayres
>> via users 
>> Sent: Tuesday, May 3, 2022 1:07 PM
>> To: users@lists.open-mpi.org
>> Cc: Scott Sayres
>> Subject: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3
>>
>> Hello,
>> I am new to openmpi, but would like to use it for ORCA calculations, and
>> plan to run codes on the 10 processors of my macbook pro.  I installed this
>> manually and also through homebrew with similar results.  I am able to
>> compile codes with mpicc and run them as native codes, but everything that
>> I attempt with mpirun, mpiexec just freezes.  I can end the program by
>> typing 'control C' twice, but it continues to run in the background and
>> requires me to 'kill '.
>> even as simple as 'mpirun uname' freezes
>>
>> I have tried one installation by: 'arch -arm64 brew install openmpi '
>> and a second by downloading the source file, './configure
>> --prefix=/usr/local', 'make all', make install
>>
>> the commands: 'which mpicc', 'which 'mpirun', etc are able to find them
>> on the path... it just hangs.
>>
>> Can anyone suggest how to fix the problem of the program hanging?
>> Thanks!
>> Scott
>>
>
>
> --
> Scott G Sayres
> Assistant Professor
> School of Molecular Sciences (formerly Department of Chemistry &
> Biochemistry)
> Biodesign Center for Applied Structural Discovery
> Arizona State University
>


Re: [OMPI users] help with M1 chip macOS openMPI installation

2022-04-22 Thread George Bosilca via users
I think you should work under the assumption of cross-compile, because the
target architecture for the OMPI compile should be x86 and not the local
architecture. It’s been a while I haven’t cross-compile, but I heard Gilles
is doing cross-compilation routinely, so he might be able to help.

  George.


On Fri, Apr 22, 2022 at 13:14 Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> Can you send all the information listed under "For compile problems"
> (please compress!):
>
> https://www.open-mpi.org/community/help/
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> 
> From: users  on behalf of Cici Feng via
> users 
> Sent: Friday, April 22, 2022 5:30 AM
> To: Open MPI Users
> Cc: Cici Feng
> Subject: Re: [OMPI users] help with M1 chip macOS openMPI installation
>
> Hi George,
>
> Thanks so much with the tips and I have installed Rosetta in order for my
> computer to run the Intel software. However, the same error appears as I
> tried to make the file for the OMPI and here's how it looks:
>
> ../../../../opal/threads/thread_usage.h(163): warning #266: function
> "opal_atomic_swap_ptr" declared implicitly
>
>   OPAL_THREAD_DEFINE_ATOMIC_SWAP(void *, intptr_t, ptr)
>
>   ^
>
>
> In file included from ../../../../opal/class/opal_object.h(126),
>
>  from ../../../../opal/dss/dss_types.h(40),
>
>  from ../../../../opal/dss/dss.h(32),
>
>  from pmix3x_server_north.c(27):
>
> ../../../../opal/threads/thread_usage.h(163): warning #120: return value
> type does not match the function type
>
>   OPAL_THREAD_DEFINE_ATOMIC_SWAP(void *, intptr_t, ptr)
>
>   ^
>
>
> pmix3x_server_north.c(157): warning #266: function "opal_atomic_rmb"
> declared implicitly
>
>   OPAL_ACQUIRE_OBJECT(opalcaddy);
>
>   ^
>
>
>   CCLD mca_pmix_pmix3x.la<http://mca_pmix_pmix3x.la>
>
> Making all in mca/pstat/test
>
>   CCLD mca_pstat_test.la<http://mca_pstat_test.la>
>
> Making all in mca/rcache/grdma
>
>   CCLD mca_rcache_grdma.la<http://mca_rcache_grdma.la>
>
> Making all in mca/reachable/weighted
>
>   CCLD mca_reachable_weighted.la<http://mca_reachable_weighted.la>
>
> Making all in mca/shmem/mmap
>
>   CCLD mca_shmem_mmap.la<http://mca_shmem_mmap.la>
>
> Making all in mca/shmem/posix
>
>   CCLD mca_shmem_posix.la<http://mca_shmem_posix.la>
>
> Making all in mca/shmem/sysv
>
>   CCLD mca_shmem_sysv.la<http://mca_shmem_sysv.la>
>
> Making all in tools/wrappers
>
>   CCLD opal_wrapper
>
> Undefined symbols for architecture x86_64:
>
>   "_opal_atomic_add_fetch_32", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_compare_exchange_strong_32", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_compare_exchange_strong_ptr", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_lock", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_lock_init", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_mb", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_rmb", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_sub_fetch_32", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_swap_32", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_swap_ptr", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_unlock", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_wmb", referenced from:
>
>   import-atom in libopen-pal.dylib
>
> ld: symbol(s) not found for architecture x86_64
>
> make[2]: *** [opal_wrapper] Error 1
>
> make[1]: *** [all-recursive] Error 1
>
> make: *** [all-recursive] Error 1
>
>
> I am not sure if the ld part affects the making process or not. Either
> way, error 1 appears as the "opal_wrapper" which I think has been the error
> I kept encoutering.
>
> Is there any explanation to this specific error?
>
> ps. the configure command I used is as followed, provided by the official
> website of MARE2DEM
>
> sudo  ./configure --prefix=/opt/openmpi CC=icc CXX=icc F77=ifort FC=ifort \
> lt_prog_compiler_wl_FC='-Wl,';
> ma

Re: [OMPI users] help with M1 chip macOS openMPI installation

2022-04-21 Thread George Bosilca via users
1. I am not aware of any outstanding OMPI issues with the M1 chip that
would prevent OMPI from compiling and running efficiently in an M1-based
setup, assuming the compilation chain is working properly.

2. M1 supports x86 code via Rosetta, an app provided by Apple to ensure a
smooth transition from the Intel-based to the M1-based laptop's line. I do
recall running an OMPI compiled on my Intel laptop on my M1 laptop to test
the performance of the Rosetta binary translator. We even had some
discussions about this, on the mailing list (or github issues).

3. Based on your original message, and their webpage, MARE2DEM is not
supporting any other compilation chain than Intel. As explained above, that
might not be by itself a showstopper, because you can run x86 code on the
M1 chip, using Rosetta. However, MARE2DEM relies on MKL, the Intel Math
Library, and that library will not run on a M1 chip.

  George.


On Thu, Apr 21, 2022 at 7:02 AM Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> A little more color on Gilles' answer: I believe that we had some Open MPI
> community members work on adding M1 support to Open MPI, but Gilles is
> absolutely correct: the underlying compiler has to support the M1, or you
> won't get anywhere.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> 
> From: users  on behalf of Cici Feng via
> users 
> Sent: Thursday, April 21, 2022 6:11 AM
> To: Open MPI Users
> Cc: Cici Feng
> Subject: Re: [OMPI users] help with M1 chip macOS openMPI installation
>
> Gilles,
>
> Thank you so much for the quick response!
> openMPI installed by brew is compiled on gcc and gfortran using the
> original compilers by Apple. Now I haven't figured out how to use this gcc
> openMPI for the inversion software :(
> Given by your answer, I think I'll pause for now with the M1-intel
> compilers-openMPI route and switch to an intel cluster until someone
> figured out the M1 chip problem ~
>
> Thanks again for your help!
> Cici
>
> On Thu, Apr 21, 2022 at 5:59 PM Gilles Gouaillardet via users <
> users@lists.open-mpi.org> wrote:
> Cici,
>
> I do not think the Intel C compiler is able to generate native code for
> the M1 (aarch64).
> The best case scenario is it would generate code for x86_64 and then
> Rosetta would be used to translate it to aarch64 code,
> and this is a very downgraded solution.
>
> So if you really want to stick to the Intel compiler, I strongly encourage
> you to run on Intel/AMD processors.
> Otherwise, use a native compiler for aarch64, and in this case, brew is
> not a bad option.
>
>
> Cheers,
>
> Gilles
>
> On Thu, Apr 21, 2022 at 6:36 PM Cici Feng via users <
> users@lists.open-mpi.org> wrote:
> Hi there,
>
> I am trying to install an electromagnetic inversion software (MARE2DEM) of
> which the intel C compilers and open-MPI are considered as the
> prerequisite. However, since I am completely new to computer science and
> coding, together with some of the technical issues of the computer I am
> building all this on, I have encountered some questions with the whole
> process.
>
> The computer I am working on is a macbook pro with a M1 Max chip. Despite
> how my friends have discouraged me to keep working on my M1 laptop, I still
> want to reach out to the developers since I feel like you guys might have a
> solution.
>
> By downloading the source code of openMPI on the .org website and "sudo
> configure and make all install", I was not able to install the openMPI onto
> my computer. The error provided mentioned something about the chip is not
> supported or somewhat.
>
> I have also tried to install openMPI through homebrew using the command
> "brew install openmpi" and it worked just fine. However, since Homebrew has
> automatically set up the configuration of openMPI (it uses gcc and
> gfortran), I was not able to use my intel compilers to build openMPI which
> causes further problems in the installation of my inversion software.
>
> In conclusion, I think right now the M1 chip is the biggest problem of the
> whole installation process yet I think you guys might have some solution
> for the installation. I would assume that Apple is switching all of its
> chip to M1 which makes the shifts and changes inevitable.
>
> I would really like to hear from you with the solution of installing
> openMPI on a M1-chip macbook and I would like to thank for your time to
> read my prolong email.
>
> Thank you very much.
> Sincerely,
>
> Cici
>
>
>
>
>
>


Re: [OMPI users] Monitoring an openmpi cluster.

2022-04-08 Thread George Bosilca via users
Vladimir,

A while back the best cluster monitoring tool was Ganglia (
http://ganglia.sourceforge.net/), but it has not been maintained for
several years. There are quite a few alternatives out there, I found
nightingale (https://github.com/didi/nightingale) to be simple to install
and use.

Good luck,
  George.


On Fri, Apr 8, 2022 at 6:09 AM Vladimir Nikishkin via users <
users@lists.open-mpi.org> wrote:

> Hello, everyone.
>
> Sorry if my message is somehow off-topic, but googling returns too many
> results, rather than too few, so I would like to ask for someone's
> personal experience.
>
> So, I have a cluster of a few identically set up machines with a shared
> NFS space.
> I would like to have some visualisation of how this cluster is used.
> E.g., how many machines are up, how many are down, how much memory is
> available on each node, how the cluster performance changes with time
> (e.g. total bogomips, total memory), ping to each machine, et cetera.
>
> Can someone recommend some openmpi-oriented monitoring software for such
> a use case?
>
> --
> Your sincerely,
> Vladimir Nikishkin (MiEr, lockywolf)
> (Laptop)
>


Re: [OMPI users] Regarding process binding on OS X with oversubscription

2022-03-17 Thread George Bosilca via users
Sajid,

`--bind-to-core` should have generated the same warning on OSX. Not sure
why this is happening, but I think the real bug here is the lack of warning
when using the deprecated argument.

Btw, the current master does not even accept 'bind-to-core', instead it
complains about 'unrecognized option'.

  George.


On Thu, Mar 17, 2022 at 4:04 PM Sajid Ali 
wrote:

> Hi George,
>
> Thanks a lot for the confirmation!
>
> When one uses the deprecated `--bind-to-core` option, is OpenMPI silently
> ignoring this on OS X? Would this be indicated with increased verbosity
> when using mpiexec?
>
> Thank You,
> Sajid Ali (he/him) | PhD Candidate
> Applied Physics
> Northwestern University
> s-sajid-ali.github.io
>


Re: [OMPI users] Regarding process binding on OS X with oversubscription

2022-03-17 Thread George Bosilca via users
OMPI cannot support process binding on OSX because, as the message
indicates, there is no OS API for process binding (at least not exposed to
the user-land applications).

  George.


On Thu, Mar 17, 2022 at 3:25 PM Sajid Ali via users <
users@lists.open-mpi.org> wrote:

> Hi OpenMPI-developers,
>
> When trying to run a program with process binding and oversubscription (on
> a github actions CI instance) with --bind-to-core, OpenMPI’s mpiexec
> executes the programs with no issues.
>
> Noting that --bind-to core is more portable (MPICH’s mpiexec also accepts
> it) and that --bind-to-core is deprecated, I tried switching to it.
> However, OpenMPI now complains that it cannot perform process binding on OS
> X with the following message:
>
> On OS X, processor and memory binding is not available at all (i.e.,
> the OS does not expose this functionality).
>
> Could someone confirm whether OpenMPI supports process binding on OS X and
> also comment on why --bind-to-core works but --bind-to core doesn’t?
> Thanks in advance!
>
> Thank You,
> Sajid Ali (he/him) | PhD Candidate
> Applied Physics
> Northwestern University
> s-sajid-ali.github.io
>


Re: [OMPI users] MPI_Intercomm_create error

2022-03-16 Thread George Bosilca via users
I see similar issues on platforms with multiple IP addresses, if some of
them are not fully connected. In general, specifying which interface OMPI
can use (with --mca btl_tcp_if_include x.y.z.t/s) solves the problem.

  George.


On Wed, Mar 16, 2022 at 5:11 PM Mccall, Kurt E. (MSFC-EV41) via users <
users@lists.open-mpi.org> wrote:

> I’m using OpenMpi 4.1.2 under Slurm 20.11.8.  My 2 process job is
> successfully launched, but when the main process rank 0
>
> attempts to create an intercommunicator with process rank 1 on the other
> node:
>
>
>
> MPI_Comm intercom;
>
> MPI_Intercomm_create(MPI_COMM_SELF, 0, MPI_COMM_WORLD, 1, ,
>   );
>
>
>
> OpenMpi spins deep inside the MPI_Intercomm_create code, and the following
> error is reported:
>
>
>
> *WARNING: Open MPI accepted a TCP connection from what appears to be a*
>
> *another Open MPI process but cannot find a corresponding process*
>
> *entry for that peer.*
>
>
>
> *This attempted connection will be ignored; your MPI job may or may not*
>
> *continue properly.*
>
>
>
> The output resulting from using the mpirun arguments “--mca
> ras_base_verbose 5 --display-devel-map --mca rmaps_base_verbose 5” is
> attached.
>
> Any help would be appreciated.
>


Re: [OMPI users] Call to MPI_Allreduce() returning value 15

2022-03-09 Thread George Bosilca via users
There are two ways the MPI_Allreduce returns MPI_ERR_TRUNCATE:
1. it is propagated from one of the underlying point-to-point
communications, which means that at least one of the participants has an
input buffer with a larger size. I know you said the size is fixed, but it
only matters if all processes are in the same blocking MPI_Allreduce.
2. The code is not SPMD, and one of your processes calls a different
MPI_Allreduce on the same communicator.

There is no simple way to get more information about this issue. If you
have a version of OMPI compiled in debug mode, you can increase the
verbosity of the collective framework to see if you get more interesting
information.

George.


On Wed, Mar 9, 2022 at 2:23 PM Ernesto Prudencio via users <
users@lists.open-mpi.org> wrote:

> Hello all,
>
>
>
> The very simple code below returns mpiRC = 15.
>
>
>
> const std::array< double, 2 > rangeMin { minX, minY };
>
> std::array< double, 2 > rangeTempRecv { 0.0, 0.0 };
>
> int mpiRC = MPI_Allreduce( rangeMin.data(), rangeTempRecv.data(),
> rangeMin.size(), MPI_DOUBLE, MPI_MIN, PETSC_COMM_WORLD );
>
>
>
> Some information before my questions:
>
>1. The environment I am running this code has hundreds of compute
>nodes, each node with 4 MPI ranks.
>2. It is running in the cloud, so it is tricky to get extra
>information “on the fly”.
>3. I am using OpenMPI 4.1.2 + PETSc 3.16.5 + GNU compilers.
>4. The error happens consistently at the same point in the execution,
>at ranks 1 and 2 only (out of hundreds of MPI ranks).
>5. By the time the execution gets to the code above, the execution has
>already called PetscInitialize() and many MPI routines successfully
>6. Before the call to MPI_Allreduce() above, the code calls
>MPI_Barrier(). So, all nodes call MPI_Allreduce()
>7. At https://www.open-mpi.org/doc/current/man3/OpenMPI.3.php it is
>written “MPI_ERR_TRUNCATE  15  Message truncated on receive.”
>8. At https://www.open-mpi.org/doc/v4.1/man3/MPI_Allreduce.3.php, it
>is written “The reduction functions ( *MPI_Op* ) do not return an
>error value. As a result, if the functions detect an error, all they can do
>is either call *MPI_Abort
>* or silently
>skip the problem. Thus, if you change the error handler from
>*MPI_ERRORS_ARE_FATAL* to something else, for example,
>*MPI_ERRORS_RETURN* , then no error may be indicated.”
>
>
>
> Questions:
>
>1. Any ideas for what could be the cause for the return code 15? The
>code is pretty simple and the buffers have fixed size = 2.
>2. In view of item (8), does it mean that the return code 15 in item
>(7) might not be informative?
>3. Once I get a return code != MPI_SUCCESS, is there any routine I can
>call, in the application code, to get extra information on MPI?
>4. Once the application aborts (I throw an exception once a return
>code is != MPI_SUCESS), is there some command line I can run on all nodes
>in order to get extra info?
>
>
>
> Thank you in advance,
>
>
>
> Ernesto.
>
> Schlumberger-Private
>


Re: [OMPI users] Where can a graph communicator be used?

2022-02-15 Thread George Bosilca via users
Sorry, I should have been more precise in my answer. Topology information
is only used during neighborhood communications via the specialized API, in
all other cases the communicator would behave as a normal, fully connected,
communicator.

  George.


On Tue, Feb 15, 2022 at 9:28 AM Neil Carlson via users <
users@lists.open-mpi.org> wrote:

>
>
> On Mon, Feb 14, 2022 at 9:01 PM George Bosilca 
> wrote:
>
>> On Mon, Feb 14, 2022 at 6:33 PM Neil Carlson via users <
>> users@lists.open-mpi.org> wrote:
>>
>>> 1. Where can I use this communicator?  Can it be used with  the usual
>>> stuff like MPI_Allgather, or do I need to hang onto the original
>>> communicator (MPI_COMM_WORLD actually) for that purpose?
>>>
>>
>> Anywhere a communicator is used. You just have to be careful and
>> understand what is the scope of the communication you use them with.
>>
>
> Ah! I was thinking that this graph topology information might only be
> relevant to MPI_Neighbor collectives. But would it be proper then to think
> of a communicator having an implicit totally-connected graph topology that
> is replaced by this one? If so would Bcast, for example, only send from the
> root rank to those it was a source for in the graph topology? Or Gather on
> a rank only receive values from those ranks that were a source for it? What
> would the difference be then between Alltoallv, say, and Neighbor_alltoallv?
>


Re: [OMPI users] Where can a graph communicator be used?

2022-02-14 Thread George Bosilca via users
On Mon, Feb 14, 2022 at 6:33 PM Neil Carlson via users <
users@lists.open-mpi.org> wrote:

> I've been successful at using MPI_Dist_graph_create_adjacent to create a
> new communicator with graph topology, and using it with
> MPI_Neighbor_alltoallv.  But I have a few questions:
>
> 1. Where can I use this communicator?  Can it be used with  the usual
> stuff like MPI_Allgather, or do I need to hang onto the original
> communicator (MPI_COMM_WORLD actually) for that purpose?
>

Anywhere a communicator is used. You just have to be careful and understand
what is the scope of the communication you use them with.

2. It turns out that my graph isn't symmetric sometimes (but I think I
> understood that is okay). I usually just need to send stuff in one
> direction, but occasionally it needs to go in the reverse direction.  Am I
> right that I need a second graph communicator built with the reverse edges
> to use with MPI_Neighbor_alltoallv for that communication?  My testing
> seems to indicate so, but I'm not absolutely certain.
>

If the symmetry is statically defined then you should use a single
communicator. However, if the symmetry is defined based on some conditions
that change during the execution then you will need 2 communicators, and
use them accordingly to your needs.


> 3. Is there any real advantage to using the non-symmetric graph, or should
> I just symmetrize it and use the one?
>

Reducing communications would be a good reason to use a non-symmetric
graph. But clearly if you are saving a small fraction of your overall
communication volume this might not be as critical as having a readable
code.

  George.



>
> Thanks for your help!
>


Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-11 Thread George Bosilca via users
I am not sure I understand the comment about MPI_T.

Each network card has internal counters that can be gathered by any process
on the node. Similarly, some information is available from the switches,
but I always assumed that information is aggregated across all ongoing
jobs. But, merging the switch-level information with the MPI level the
necessary trend can be highlighted.

  George.


On Fri, Feb 11, 2022 at 12:43 PM Bertini, Denis Dr. 
wrote:

> May be i am wrong, but the MPI_T seems to aim to internal openMPI
> parameters right?
>
>
> So with which kind of magic a tool like OSU INAM can get info from network
> fabric and even
>
> switches related to a particular MPI job ...
>
>
> There should be more info gathered in the background 
>
>
> ------
> *From:* George Bosilca 
> *Sent:* Friday, February 11, 2022 4:25:42 PM
> *To:* Open MPI Users
> *Cc:* Joseph Schuchart; Bertini, Denis Dr.
> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband
> network
>
> Collecting data during execution is possible in OMPI either with an
> external tool, such as mpiP, or the internal infrastructure, SPC. Take a
> look at ./examples/spc_example.c or ./test/spc/spc_test.c to see how to use
> this.
>
>   George.
>
>
> On Fri, Feb 11, 2022 at 9:43 AM Bertini, Denis Dr. via users <
> users@lists.open-mpi.org> wrote:
>
>> I have seen in OSU INAM paper:
>>
>>
>> "
>> While we chose MVAPICH2 for implementing our designs, any MPI
>> runtime (e.g.: OpenMPI [12]) can be modified to perform similar data
>> collection and
>> transmission.
>> "
>>
>> But i do not know what it is meant with "modified" openMPI ?
>>
>>
>> Cheers,
>>
>> Denis
>>
>>
>> --
>> *From:* Joseph Schuchart 
>> *Sent:* Friday, February 11, 2022 3:02:36 PM
>> *To:* Bertini, Denis Dr.; Open MPI Users
>> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband
>> network
>>
>> I am not aware of anything similar in Open MPI. Maybe OSU-INAM can work
>> with other MPI implementations? Would be worth investigating...
>>
>> Joseph
>>
>> On 2/11/22 06:54, Bertini, Denis Dr. wrote:
>> >
>> > Hi Joseph
>> >
>> > Looking at the MVAPICH i noticed that, in this MPI implementation
>> >
>> > a Infiniband Network Analysis  and Profiling Tool  is provided:
>> >
>> >
>> > OSU-INAM
>> >
>> >
>> > Is there something equivalent using openMPI ?
>> >
>> > Best
>> >
>> > Denis
>> >
>> >
>> > 
>> > *From:* users  on behalf of Joseph
>> > Schuchart via users 
>> > *Sent:* Tuesday, February 8, 2022 4:02:53 PM
>> > *To:* users@lists.open-mpi.org
>> > *Cc:* Joseph Schuchart
>> > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
>> > Infiniband network
>> > Hi Denis,
>> >
>> > Sorry if I missed it in your previous messages but could you also try
>> > running a different MPI implementation (MVAPICH) to see whether Open MPI
>> > is at fault or the system is somehow to blame for it?
>> >
>> > Thanks
>> > Joseph
>> >
>> > On 2/8/22 03:06, Bertini, Denis Dr. via users wrote:
>> > >
>> > > Hi
>> > >
>> > > Thanks for all these informations !
>> > >
>> > >
>> > > But i have to confess that in this multi-tuning-parameter space,
>> > >
>> > > i got somehow lost.
>> > >
>> > > Furthermore it is somtimes mixing between user-space and kernel-space.
>> > >
>> > > I have only possibility to act on the user space.
>> > >
>> > >
>> > > 1) So i have on the system max locked memory:
>> > >
>> > > - ulimit -l unlimited (default )
>> > >
>> > >   and i do not see any warnings/errors related to that when
>> > launching MPI.
>> > >
>> > >
>> > > 2) I tried differents algorithms for MPI_all_reduce op.  all showing
>> > > drop in
>> > >
>> > > bw for size=16384
>> > >
>> > >
>> > > 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed
>> > >
>> > > the same behaviour.
>> > >
>> > >
>

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-11 Thread George Bosilca via users
Collecting data during execution is possible in OMPI either with an
external tool, such as mpiP, or the internal infrastructure, SPC. Take a
look at ./examples/spc_example.c or ./test/spc/spc_test.c to see how to use
this.

  George.


On Fri, Feb 11, 2022 at 9:43 AM Bertini, Denis Dr. via users <
users@lists.open-mpi.org> wrote:

> I have seen in OSU INAM paper:
>
>
> "
> While we chose MVAPICH2 for implementing our designs, any MPI
> runtime (e.g.: OpenMPI [12]) can be modified to perform similar data
> collection and
> transmission.
> "
>
> But i do not know what it is meant with "modified" openMPI ?
>
>
> Cheers,
>
> Denis
>
>
> --
> *From:* Joseph Schuchart 
> *Sent:* Friday, February 11, 2022 3:02:36 PM
> *To:* Bertini, Denis Dr.; Open MPI Users
> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband
> network
>
> I am not aware of anything similar in Open MPI. Maybe OSU-INAM can work
> with other MPI implementations? Would be worth investigating...
>
> Joseph
>
> On 2/11/22 06:54, Bertini, Denis Dr. wrote:
> >
> > Hi Joseph
> >
> > Looking at the MVAPICH i noticed that, in this MPI implementation
> >
> > a Infiniband Network Analysis  and Profiling Tool  is provided:
> >
> >
> > OSU-INAM
> >
> >
> > Is there something equivalent using openMPI ?
> >
> > Best
> >
> > Denis
> >
> >
> > 
> > *From:* users  on behalf of Joseph
> > Schuchart via users 
> > *Sent:* Tuesday, February 8, 2022 4:02:53 PM
> > *To:* users@lists.open-mpi.org
> > *Cc:* Joseph Schuchart
> > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
> > Infiniband network
> > Hi Denis,
> >
> > Sorry if I missed it in your previous messages but could you also try
> > running a different MPI implementation (MVAPICH) to see whether Open MPI
> > is at fault or the system is somehow to blame for it?
> >
> > Thanks
> > Joseph
> >
> > On 2/8/22 03:06, Bertini, Denis Dr. via users wrote:
> > >
> > > Hi
> > >
> > > Thanks for all these informations !
> > >
> > >
> > > But i have to confess that in this multi-tuning-parameter space,
> > >
> > > i got somehow lost.
> > >
> > > Furthermore it is somtimes mixing between user-space and kernel-space.
> > >
> > > I have only possibility to act on the user space.
> > >
> > >
> > > 1) So i have on the system max locked memory:
> > >
> > > - ulimit -l unlimited (default )
> > >
> > >   and i do not see any warnings/errors related to that when
> > launching MPI.
> > >
> > >
> > > 2) I tried differents algorithms for MPI_all_reduce op.  all showing
> > > drop in
> > >
> > > bw for size=16384
> > >
> > >
> > > 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed
> > >
> > > the same behaviour.
> > >
> > >
> > > 3) i realized that increasing the so-called warm up parameter  in the
> > >
> > > OSU benchmark (argument -x 200 as default) the discrepancy.
> > >
> > > At the contrary putting lower threshold ( -x 10 ) can increase this BW
> > >
> > > discrepancy up to factor 300 at message size 16384 compare to
> > >
> > > message size 8192 for example.
> > >
> > > So does it means that there are some caching effects
> > >
> > > in the internode communication?
> > >
> > >
> > > From my experience, to tune parameters is a time-consuming and
> > cumbersome
> > >
> > > task.
> > >
> > >
> > > Could it also be the problem is not really on the openMPI
> > > implemenation but on the
> > >
> > > system?
> > >
> > >
> > > Best
> > >
> > > Denis
> > >
> > >
> 
> > > *From:* users  on behalf of Gus
> > > Correa via users 
> > > *Sent:* Monday, February 7, 2022 9:14:19 PM
> > > *To:* Open MPI Users
> > > *Cc:* Gus Correa
> > > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
> > > Infiniband network
> > > This may have changed since, but these used to be relevant points.
> > > Overall, the Open MPI FAQ have lots of good suggestions:
> > > https://www.open-mpi.org/faq/
> > > some specific for performance tuning:
> > > https://www.open-mpi.org/faq/?category=tuning
> > > https://www.open-mpi.org/faq/?category=openfabrics
> > >
> > > 1) Make sure you are not using the Ethernet TCP/IP, which is widely
> > > available in compute nodes:
> > > mpirun  --mca btl self,sm,openib  ...
> > >
> > > https://www.open-mpi.org/faq/?category=tuning#selecting-components
> > >
> > > However, this may have changed lately:
> > > https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable
> > > 2) Maximum locked memory used by IB and their system limit. Start
> > > here:
> > >
> >
> https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage
> > > 3) The eager vs. rendezvous message size threshold. I wonder if it may
> > > sit right where you see the latency spike.
> > > https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user
> > > 4) Processor and memory locality/affinity 

Re: [OMPI users] unexpected behavior when combining MPI_Gather and MPI_Type_vector

2021-12-16 Thread George Bosilca via users
Jonas,

The section 5.1.6 in MPI 4.0 should give you a better idea about the
differences between size, extent and true extent. There are also few
examples in Section 5.1.14 on how to manipulate the datatype using extent.
I think you should find Examples 5.13 to 5.16 of particular interest.

Best,
  George.


On Thu, Dec 16, 2021 at 5:15 PM Jonas Thies  wrote:

> Hi George,
>
> thanks, I'm starting to understand this now.
>
> Still not quite intuitive that "Type_create_resized" allows me to reset
> the extent but not the size (just from a naming perspective).
>
> The man page is talking about extent, upper and lower bounds, but the
> upper bound cannot be specified:
>
> NAME
>MPI_Type_create_resized  -  Returns a new data type with new extent
> and
>upper and lower bounds.
>
> SYNTAX
> C Syntax
>#include 
>int MPI_Type_create_resized(MPI_Datatype oldtype, MPI_Aint lb,
> MPI_Aint extent, MPI_Datatype *newtype)
>
>
> Jonas
> On 16-12-2021 22:39, George Bosilca wrote:
>
> You are confusing the size and extent of the datatype. The size (aka the
> physical number of bytes described by the memory layout) would be
> m*nloc*sizeof(type), while the extent will be related to where you expect
> the second element of the same type to start. If you do resize, you will
> incorporate the leading dimension in your pointer computation, and will see
> the gaps you were reporting.
>
>   George.
>
>
>
>
> On Thu, Dec 16, 2021 at 3:03 PM Jonas Thies via users <
> users@lists.open-mpi.org> wrote:
>
>> Dear Gilles,
>>
>> thanks, the resizing fixes the issue, it seems. It is not really
>> intuitive, though, because the actual extent of the data type is
>> m*nloc*sizeof(int) and I have to make MPI believe that it is
>> nloc*sizeof(int). And indeed, this seems to be not OpenMPI-specific, sorry
>> for that.
>>
>> Best,
>>
>> Jonas
>>
>>   MPI_Type_vector (Gilles Gouaillardet)
>>
>>
>> --
>>
>> Message: 1
>> Date: Thu, 16 Dec 2021 10:29:27 +0100
>> From: Jonas Thies  
>> To: users@lists.open-mpi.org
>> Subject: [OMPI users] unexpected behavior when combining MPI_Gather
>>  and MPI_Type_vector
>> Message-ID: <64075574-7a58-b194-208f-d455c10c8...@tudelft.nl> 
>> <64075574-7a58-b194-208f-d455c10c8...@tudelft.nl>
>> Content-Type: text/plain; charset="utf-8"; Format="flowed"
>>
>> Dear OpenMPI community,
>>
>> Here's a little puzzle for the Christmas holidays (although I would
>> really appreciate a quick solution!).
>>
>> I'm stuck with the following relatively basic problem: given a local
>> nloc x m matrix X_p in column-major ordering on each MPI process p,
>> perform a single MPI_Gather operation to construct the matrix
>> X_0
>> X_1
>> ...
>>
>> X_nproc
>>
>> again, in col-major ordering. My approach is to use MPI_Type_vector to
>> define an stype and an rtype, where stype has stride nloc, and rtype has
>> stride nproc*nloc. The observation is that there is an unexpected
>> displacement of (m-1)*n*p in the result array for the part arriving from
>> process p.
>>
>> The MFE code is attached, and I use OpenMPI 4.0.5 with GCC 11.2
>> (although other versions and even distributions seem to display the same
>> behavior). Example (nloc=3, nproc=3, m=2, with some additional columns
>> printed for the sake of demonstration):
>>
>>
>>  > mpicxx -o matrix_gather matrix_gather.cpp
>> mpirun -np 3 ./matrix_gather
>>
>> v_loc on P0: 3x2
>> 0 9
>> 1 10
>> 2 11
>>
>> v_loc on P1: 3x2
>> 3 12
>> 4 13
>> 5 14
>>
>> v_loc on P2: 3x2
>> 6 15
>> 7 16
>> 8 17
>>
>> v_glob on P0: 9x4
>> 0 9 0 0
>> 1 10 0 0
>> 2 11 0 0
>> 0 3 12 0
>> 0 4 13 0
>> 0 5 14 0
>> 0 0 6 15
>> 0 0 7 16
>> 0 0 8 17
>>
>> Any ideas?
>>
>> Thanks,
>>
>> Jonas
>>
>>
>>
>> --
>> *J. Thies*
>> Assistant Professor
>>
>> TU Delft
>> Faculty Electrical Engineering, Mathematics and Computer Science
>> Institute of Applied Mathematics and High Performance Computing Center
>> Mekelweg 4
>> 2628 CD Delft
>>
>> T +31 15 27 
>> *j.th...@tudelft.nl *
>>
> --
> *J. Thies*
> Assistant Professor
>
> TU Delft
> Faculty Electrical Engineering, Mathematics and Computer Science
> Institute of Applied Mathematics and High Performance Computing Center
> Mekelweg 4
> 2628 CD Delft
>
> T +31 15 27 
> *j.th...@tudelft.nl *
>


Re: [OMPI users] unexpected behavior when combining MPI_Gather and MPI_Type_vector

2021-12-16 Thread George Bosilca via users
You are confusing the size and extent of the datatype. The size (aka the
physical number of bytes described by the memory layout) would be
m*nloc*sizeof(type), while the extent will be related to where you expect
the second element of the same type to start. If you do resize, you will
incorporate the leading dimension in your pointer computation, and will see
the gaps you were reporting.

  George.




On Thu, Dec 16, 2021 at 3:03 PM Jonas Thies via users <
users@lists.open-mpi.org> wrote:

> Dear Gilles,
>
> thanks, the resizing fixes the issue, it seems. It is not really
> intuitive, though, because the actual extent of the data type is
> m*nloc*sizeof(int) and I have to make MPI believe that it is
> nloc*sizeof(int). And indeed, this seems to be not OpenMPI-specific, sorry
> for that.
>
> Best,
>
> Jonas
>
>   MPI_Type_vector (Gilles Gouaillardet)
>
>
> --
>
> Message: 1
> Date: Thu, 16 Dec 2021 10:29:27 +0100
> From: Jonas Thies  
> To: users@lists.open-mpi.org
> Subject: [OMPI users] unexpected behavior when combining MPI_Gather
>   and MPI_Type_vector
> Message-ID: <64075574-7a58-b194-208f-d455c10c8...@tudelft.nl> 
> <64075574-7a58-b194-208f-d455c10c8...@tudelft.nl>
> Content-Type: text/plain; charset="utf-8"; Format="flowed"
>
> Dear OpenMPI community,
>
> Here's a little puzzle for the Christmas holidays (although I would
> really appreciate a quick solution!).
>
> I'm stuck with the following relatively basic problem: given a local
> nloc x m matrix X_p in column-major ordering on each MPI process p,
> perform a single MPI_Gather operation to construct the matrix
> X_0
> X_1
> ...
>
> X_nproc
>
> again, in col-major ordering. My approach is to use MPI_Type_vector to
> define an stype and an rtype, where stype has stride nloc, and rtype has
> stride nproc*nloc. The observation is that there is an unexpected
> displacement of (m-1)*n*p in the result array for the part arriving from
> process p.
>
> The MFE code is attached, and I use OpenMPI 4.0.5 with GCC 11.2
> (although other versions and even distributions seem to display the same
> behavior). Example (nloc=3, nproc=3, m=2, with some additional columns
> printed for the sake of demonstration):
>
>
>  > mpicxx -o matrix_gather matrix_gather.cpp
> mpirun -np 3 ./matrix_gather
>
> v_loc on P0: 3x2
> 0 9
> 1 10
> 2 11
>
> v_loc on P1: 3x2
> 3 12
> 4 13
> 5 14
>
> v_loc on P2: 3x2
> 6 15
> 7 16
> 8 17
>
> v_glob on P0: 9x4
> 0 9 0 0
> 1 10 0 0
> 2 11 0 0
> 0 3 12 0
> 0 4 13 0
> 0 5 14 0
> 0 0 6 15
> 0 0 7 16
> 0 0 8 17
>
> Any ideas?
>
> Thanks,
>
> Jonas
>
>
>
> --
> *J. Thies*
> Assistant Professor
>
> TU Delft
> Faculty Electrical Engineering, Mathematics and Computer Science
> Institute of Applied Mathematics and High Performance Computing Center
> Mekelweg 4
> 2628 CD Delft
>
> T +31 15 27 
> *j.th...@tudelft.nl *
>


Re: [OMPI users] MPI_ERR_TAG: invalid tag

2021-09-19 Thread George Bosilca via users
The error message is self explanatory, the application calls MPI_Recv with
an invalid TAG. The MPI standard defines a valid tag as a positive integer
between 0 and the value of the MPI_UB_TAG attribute on MPI_COMM_WORLD. At
this point it seems plausible this is an application issue.

Check that the application is correctly restricting the tags used to the
valid range. If this is the case, then the issue might be on OMPI side in
which case a reproducer would be appreciated.

George.



On Sun, Sep 19, 2021 at 11:42 Feng Wade via users 
wrote:

> Hi,
>
> Good morning.
>
> I am using openmpi/4.0.3 on Compute Canada to do 3D flow simulation. My
> grid size is Lx*Ly*Lz=700*169*500. It worked quite well for lower
> resolution. However, after increasing my resolution from Nx*Ny*Nz=64*109*62
> to 256*131*192, openmpi reported errors as shown below:
>
> [gra541:21749] *** An error occurred in MPI_Recv
> [gra541:21749] *** reported by process [2068774913,140]
> [gra541:21749] *** on communicator MPI COMMUNICATOR 13 DUP FROM 0
> [gra541:21749] *** MPI_ERR_TAG: invalid tag
> [gra541:21749] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
> will now abort,
> [gra541:21749] ***and potentially your MPI job)
> [gra529:07588] 210 more processes have sent help message
> help-mpi-errors.txt / mpi_errors_are_fatal
> [gra529:07588] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
>
> This is my computation parameters and command to run openmpi:
> #!/bin/bash
> #SBATCH --time=0-10:00:00
> #SBATCH --job-name=3D_EIT_Wi64
> #SBATCH --output=log-%j
> #SBATCH --ntasks=128
> #SBATCH --nodes=4
> #SBATCH --mem-per-cpu=4000M
> mpirun ./vepoiseuilleFD_5.x
>
> I guess The value of the PATH and LD_LIBRARY_PATH environment variables
> are all set correct because my simulation worked for lower resolution ones.
>
> Thank you for your time.
>
> Sincerely
>
> Wade
>


Re: [OMPI users] Question about MPI_T

2021-08-17 Thread George Bosilca via users
You need to enable the monitoring PML in order to get access to the
pml_monitoring_messages_count MPI_T. For this you need to know what PML you
are currently using and add monitoring to the pml MCA variable. As an
example if you use ob1 you should add the following to your mpirun command
"--mca pml ob1,monitoring".

  George.


On Mon, Aug 16, 2021 at 2:06 PM Jong Choi via users <
users@lists.open-mpi.org> wrote:

> Hi.
>
> I am trying to test if I can compile and run the MPI_T test case:
> ompi/mca/common/monitoring/monitoring_prof.c
>
> But, I am getting the following error:
> cannot find monitoring MPI_T "pml_monitoring_messages_count" pvar, check
> that you have monitoring pml
>
> Should I turn on something when building openmpi? Can I get any advice on
> using MPI_T?
>
> Thanks,
> Jong
>
> --
> Jong Youl Choi
> Scientific Data Group
> Computer Science and Math Division
> Oak Ridge National Laboratory
> Homepage: http://www.ornl.gov/~jyc/
>


Re: [OMPI users] Allreduce with Op

2021-03-13 Thread George Bosilca via users
Hi Pierre,

MPI is allowed to pipeline the collective communications. This explains why
the MPI_Op takes the len of the buffers as an argument. Because your MPI_Op
ignores this length it alters data outside the temporary buffer we use for
the segment. Other versions of the MPI_Allreduce implementation might
choose not to pipeline in which case applying the MPI_Op on the entire
length of the buffer (as you manually did in your code) is correct.

  George.


On Sat, Mar 13, 2021 at 4:47 AM Pierre Jolivet via users <
users@lists.open-mpi.org> wrote:

> Hello,
> The following piece of code generates Valgrind errors with OpenMPI 4.1.0,
> while it is Valgrind-clean with MPICH and OpenMPI 4.0.5.
> I don’t think I’m doing anything illegal, so could this be a regression
> introduced in 4.1.0?
>
> Thanks,
> Pierre
>
> $ /opt/openmpi-4.1.0/bin/mpicxx ompi.cxx -g -O0 -std=c++11
> $ /opt/openmpi-4.1.0/bin/mpirun -n 4 valgrind --log-file=dump.%p.log
> ./a.out
>
>
>
> ==528== Invalid read of size 2
> ==528==at 0x4011EB: main::{lambda(void*, void*, int*,
> ompi_datatype_t**)#1}::operator()(void*, void*, int*, ompi_datatype_t**)
> const (ompi.cxx:15)
> ==528==by 0x40127B: main::{lambda(void*, void*, int*,
> ompi_datatype_t**)#1}::_FUN(void*, void*, int*, ompi_datatype_t**)
> (ompi.cxx:19)
> ==528==by 0x48EFFED: ompi_coll_base_allreduce_intra_ring (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in
> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
> ==528==by 0x48AAF00: PMPI_Allreduce (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x401317: main (ompi.cxx:21)
> ==528==  Address 0x7139f74 is 0 bytes after a block of size 4 alloc'd
> ==528==at 0x4839809: malloc (vg_replace_malloc.c:307)
> ==528==by 0x48EF940: ompi_coll_base_allreduce_intra_ring (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in
> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
> ==528==by 0x48AAF00: PMPI_Allreduce (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x401317: main (ompi.cxx:21)
> ==528==
> ==528== Invalid read of size 2
> ==528==at 0x40120E: main::{lambda(void*, void*, int*,
> ompi_datatype_t**)#1}::operator()(void*, void*, int*, ompi_datatype_t**)
> const (ompi.cxx:16)
> ==528==by 0x40127B: main::{lambda(void*, void*, int*,
> ompi_datatype_t**)#1}::_FUN(void*, void*, int*, ompi_datatype_t**)
> (ompi.cxx:19)
> ==528==by 0x48EFFED: ompi_coll_base_allreduce_intra_ring (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in
> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
> ==528==by 0x48AAF00: PMPI_Allreduce (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x401317: main (ompi.cxx:21)
> ==528==  Address 0x7139f76 is 2 bytes after a block of size 4 alloc'd
> ==528==at 0x4839809: malloc (vg_replace_malloc.c:307)
> ==528==by 0x48EF940: ompi_coll_base_allreduce_intra_ring (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in
> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
> ==528==by 0x48AAF00: PMPI_Allreduce (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x401317: main (ompi.cxx:21)
> ==528==
> ==528== Invalid read of size 2
> ==528==at 0x401231: main::{lambda(void*, void*, int*,
> ompi_datatype_t**)#1}::operator()(void*, void*, int*, ompi_datatype_t**)
> const (ompi.cxx:18)
> ==528==by 0x40127B: main::{lambda(void*, void*, int*,
> ompi_datatype_t**)#1}::_FUN(void*, void*, int*, ompi_datatype_t**)
> (ompi.cxx:19)
> ==528==by 0x48EFFED: ompi_coll_base_allreduce_intra_ring (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in
> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
> ==528==by 0x48AAF00: PMPI_Allreduce (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x401317: main (ompi.cxx:21)
> ==528==  Address 0x7139f78 is 4 bytes after a block of size 4 alloc'd
> ==528==at 0x4839809: malloc (vg_replace_malloc.c:307)
> ==528==by 0x48EF940: ompi_coll_base_allreduce_intra_ring (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in
> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
> ==528==by 0x48AAF00: PMPI_Allreduce (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x401317: main (ompi.cxx:21)


Re: [OMPI users] AVX errors building OpenMPI 4.1.0

2021-02-05 Thread George Bosilca via users
Carl,

AVX support was introduced in 4.1 which explains why you did not have such
issues before. What is your configure command in these 2 cases ? Please
create an issue on github and attach your config.log.

  George.



On Fri, Feb 5, 2021 at 2:44 PM Carl Ponder via users <
users@lists.open-mpi.org> wrote:

> Building OpenMPI 4.1.0 with the PGI 21.1 compiler on a Broadwell
> processor, I get this error
>
> libtool: compile:  pgcc -DHAVE_CONFIG_H -I. -I../../../../opal/include
> -I../../../../ompi/include -I../../../../oshmem/include
> -I../../../../opal/mca/hwloc/hwloc201/hwloc/include/private/autogen
> -I../../../../opal/mca/hwloc/hwloc201/hwloc/include/hwloc/autogen
> -I../../../../ompi/mpiext/cuda/c -DGENERATE_SSE3_CODE -DGENERATE_SSE41_CODE
> -DGENERATE_AVX_CODE -DGENERATE_AVX2_CODE -DGENERATE_AVX512_CODE
> -I../../../.. -I../../../../orte/include
> -I/gpfs/fs1/SHARE/Utils/PGI/21.1/Linux_x86_64/21.1/cuda/11.2/include
> -DUCS_S_PACKED= -I/gpfs/fs1/SHARE/Utils/ZLib/1.2.11/PGI-21.1/include
> -I/gpfs/fs1/SHARE/Utils/HWLoc/2.4.0/PGI-21.1_CUDA-11.2.0.0_460.27.04/include
> -I/usr/local/include -I/usr/local/include -march=skylake-avx512 -O3
> -DNDEBUG -m64 -tp=px -Mnodalign -fno-strict-aliasing -c op_avx_functions.c
> -MD  -fPIC -DPIC -o .libs/liblocal_ops_avx512_la-op_avx_functions.o
> LLVM ERROR: Cannot select: intrinsic %llvm.x86.sse3.ldu.dq
> Makefile:1993: recipe for target
> 'liblocal_ops_avx512_la-op_avx_functions.lo' failed
> make[2]: *** [liblocal_ops_avx512_la-op_avx_functions.lo] Error 1
> make[2]: Leaving directory
> '/gpfs/fs1/SHARE/Utils/OpenMPI/4.1.0/PGI-21.1_CUDA-11.2.0.0_460.27.04_UCX-1.10.0-rc2_HWLoc-2.4.0_ZLib-1.2.11/distro/ompi/mca/op/avx'
>
> and the GCC 10.2.0 compiler gives me errors like this:
>
> op_avx_functions.c: In function ‘ompi_op_avx_2buff_bxor_uint64_t_avx512’:
> op_avx_functions.c:208:21: warning: AVX512F vector return without AVX512F
> enabled changes the ABI [-Wpsabi]
>   208 | __m512i vecA =
> _mm512_loadu_si512((__m512i*)in);   \
>   | ^~~~
> op_avx_functions.c:263:5: note: in expansion of macro
> ‘OP_AVX_AVX512_BIT_FUNC’
>   263 | OP_AVX_AVX512_BIT_FUNC(name, type_size, type,
> op);  \
>   | ^~
> op_avx_functions.c:573:5: note: in expansion of macro ‘OP_AVX_BIT_FUNC’
>   573 | OP_AVX_BIT_FUNC(bxor, 64, uint64_t, xor)
>   | ^~~
> In file included from
> /gpfs/fs1/SHARE/Utils/GCC/10.2.0/GCC-BASE-7.5.0_GMP-6.2.1_ISL-0.23_MPFR-4.1.0_MPC-1.2.1_CUDA-11.2.0.0_460.27.04/lib/gcc/x86_64-pc-linux-gnu/10.2.0/include/immintrin.h:55,
>  from op_avx_functions.c:26:
> op_avx_functions.c: In function ‘ompi_op_avx_2buff_max_int8_t_avx512’:
> /gpfs/fs1/SHARE/Utils/GCC/10.2.0/GCC-BASE-7.5.0_GMP-6.2.1_ISL-0.23_MPFR-4.1.0_MPC-1.2.1_CUDA-11.2.0.0_460.27.04/lib/gcc/x86_64-pc-linux-gnu/10.2.0/include/avx512fintrin.h:6429:1:
> error: inlining failed in call
> to ‘always_inline’ ‘_mm512_storeu_si512’: target specific option mismatch
>
> I can get 4.1.0 to build with GCC by removing these flags
>
> -march=corei7-avx -mtune=corei7-avx
>
> and PGI by removing this flag
>
> -tp=px
>
> I didn't have these issues with the OpenMPI 4.0.4 source. Is there a bug
> in the 4.1.0?
>


Re: [OMPI users] Timeout in MPI_Bcast/MPI_Barrier?

2021-01-11 Thread George Bosilca via users
*MPI_ERR_PROC_FAILED is not yet a valid error in MPI. It is coming from
ULFM, an extension to MPI that is not yet in the OMPI master.*

*Daniel what version of Open MPI are you using ? Are you sure you are not
mixing multiple versions due to PATH/LD_LIBRARY_PATH ?*

*George.*


On Mon, Jan 11, 2021 at 21:31 Gilles Gouaillardet via users <
users@lists.open-mpi.org> wrote:

> Daniel,
>
> the test works in my environment (1 node, 32 GB memory) with all the
> mentioned parameters.
>
> Did you check the memory usage on your nodes and made sure the oom
> killer did not shoot any process?
>
> Cheers,
>
> Gilles
>
> On Tue, Jan 12, 2021 at 1:48 AM Daniel Torres via users
>  wrote:
> >
> > Hi.
> >
> > Thanks for responding. I have taken the most important parts from my
> code and I created a test that reproduces the behavior I described
> previously.
> >
> > I attach to this e-mail the compressed file "test.tar.gz". Inside him,
> you can find:
> >
> > 1.- The .c source code "test.c", which I compiled with "mpicc -g -O3
> test.c -o test -lm". The main work is performed on the function
> "work_on_grid", starting at line 162.
> > 2.- Four execution examples in two different machines (my own and a
> cluster machine), which I executed with "mpiexec -np 16 --machinefile
> hostfile --map-by node --mca btl tcp,vader,self --mca btl_base_verbose 100
> ./test 4096 4096", varying the last two arguments with 4096, 8192 and 16384
> (a matrix size). The error appears with bigger numbers (8192 in my machine,
> 16384 in the cluster)
> > 3.- The "ompi_info -a" output from the two machines.
> > 4.- The hostfile.
> >
> > The duration of the delay is just a few seconds, about 3 ~ 4.
> >
> > Essentially, the first error message I get from a waiting process is
> "74: MPI_ERR_PROC_FAILED: Process Failure".
> >
> > Hope this information can help.
> >
> > Thanks a lot for your time.
> >
> > El 08/01/21 a las 18:40, George Bosilca via users escribió:
> >
> > Daniel,
> >
> > There are no timeouts in OMPI with the exception of the initial
> connection over TCP, where we use the socket timeout to prevent deadlocks.
> As you already did quite a few communicator duplications and other
> collective communications before you see the timeout, we need more info
> about this. As Gilles indicated, having the complete output might help.
> What is the duration of the delay for the waiting process ? Also, can you
> post a replicator of this issue ?
> >
> >   George.
> >
> >
> > On Fri, Jan 8, 2021 at 9:03 AM Gilles Gouaillardet via users <
> users@lists.open-mpi.org> wrote:
> >>
> >> Daniel,
> >>
> >> Can you please post the full error message and share a reproducer for
> >> this issue?
> >>
> >> Cheers,
> >>
> >> Gilles
> >>
> >> On Fri, Jan 8, 2021 at 10:25 PM Daniel Torres via users
> >>  wrote:
> >> >
> >> > Hi all.
> >> >
> >> > Actually I'm implementing an algorithm that creates a process grid
> and divides it into row and column communicators as follows:
> >> >
> >> >  col_comm0col_comm1col_comm2 col_comm3
> >> > row_comm0P0   P1   P2P3
> >> > row_comm1P4   P5   P6P7
> >> > row_comm2P8   P9   P10   P11
> >> > row_comm3P12  P13  P14   P15
> >> >
> >> > Then, every process works on its own column communicator and
> broadcast data on row communicators.
> >> > While column operations are being executed, processes not included in
> the current column communicator just wait for results.
> >> >
> >> > In a moment, a column communicator could be splitted to create a temp
> communicator and allow only the right processes to work on it.
> >> >
> >> > At the end of a step, a call to MPI_Barrier (on a duplicate of
> MPI_COMM_WORLD) is executed to sync all processes and avoid bad results.
> >> >
> >> > With a small amount of data (a small matrix) the MPI_Barrier call
> syncs correctly on the communicator that includes all processes and
> processing ends fine.
> >> > But when the amount of data (a big matrix) is incremented, operations
> on column communicators take more time to finish and hence waiting time
> also increments for waiting processes.
> >> >
> >> > After a few time, waiting

Re: [OMPI users] Timeout in MPI_Bcast/MPI_Barrier?

2021-01-08 Thread George Bosilca via users
Daniel,

There are no timeouts in OMPI with the exception of the initial connection
over TCP, where we use the socket timeout to prevent deadlocks. As you
already did quite a few communicator duplications and other collective
communications before you see the timeout, we need more info about this. As
Gilles indicated, having the complete output might help. What is the
duration of the delay for the waiting process ? Also, can you post a
replicator of this issue ?

  George.


On Fri, Jan 8, 2021 at 9:03 AM Gilles Gouaillardet via users <
users@lists.open-mpi.org> wrote:

> Daniel,
>
> Can you please post the full error message and share a reproducer for
> this issue?
>
> Cheers,
>
> Gilles
>
> On Fri, Jan 8, 2021 at 10:25 PM Daniel Torres via users
>  wrote:
> >
> > Hi all.
> >
> > Actually I'm implementing an algorithm that creates a process grid and
> divides it into row and column communicators as follows:
> >
> >  col_comm0col_comm1col_comm2 col_comm3
> > row_comm0P0   P1   P2P3
> > row_comm1P4   P5   P6P7
> > row_comm2P8   P9   P10   P11
> > row_comm3P12  P13  P14   P15
> >
> > Then, every process works on its own column communicator and broadcast
> data on row communicators.
> > While column operations are being executed, processes not included in
> the current column communicator just wait for results.
> >
> > In a moment, a column communicator could be splitted to create a temp
> communicator and allow only the right processes to work on it.
> >
> > At the end of a step, a call to MPI_Barrier (on a duplicate of
> MPI_COMM_WORLD) is executed to sync all processes and avoid bad results.
> >
> > With a small amount of data (a small matrix) the MPI_Barrier call syncs
> correctly on the communicator that includes all processes and processing
> ends fine.
> > But when the amount of data (a big matrix) is incremented, operations on
> column communicators take more time to finish and hence waiting time also
> increments for waiting processes.
> >
> > After a few time, waiting processes return an error when they have not
> received the broadcast (MPI_Bcast) on row communicators or when they have
> finished their work at the sync point (MPI_Barrier). But when the
> operations on the current column communicator end, the still active
> processes try to broadcast on row communicators and they fail because the
> waiting processes have returned an error. So all processes fail in
> different moment in time.
> >
> > So my problem is that waiting processes "believe" that the current
> operations have failed (but they have not finished yet!) and they fail too.
> >
> > So I have a question about MPI_Bcast/MPI_Barrier:
> >
> > Is there a way to increment the timeout a process can wait for a
> broadcast or barrier to be completed?
> >
> > Here is my machine and OpenMPI info:
> > - OpenMPI version: Open MPI 4.1.0u1a1
> > - OS: Linux Daniel 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15 10:57:00
> UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
> >
> > Thanks in advance for reading my description/question.
> >
> > Best regards.
> >
> > --
> > Daniel Torres
> > LIPN - Université Sorbonne Paris Nord
>


Re: [OMPI users] MPI_type_free question

2020-12-04 Thread George Bosilca via users
On Fri, Dec 4, 2020 at 2:33 AM Patrick Bégou via users <
users@lists.open-mpi.org> wrote:

> Hi George and Gilles,
>
> Thanks George for your suggestion. Is it valuable for 4.05 and 3.1 OpenMPI
> Versions ? I will have a look today at these tables. May be writing a small
> piece of code juste creating and freeing subarray datatype.
>

Patrick,

If you use Gilles' suggestion to go through the type_f2c function when
listing the datatypes should give you a portable datatype iterator across
all versions of OMPI. The call to dump a datatype content,
ompi_datatype_dump, has been there for a very long time, so the combination
of the two should work everywhere.

Thinking a little more about this, you don't necessarily have to dump the
content of the datatype, you only need to check if they are different from
MPI_DATATYPE_NULL. Thus, you can have a solution using only the MPI API.

  George.


>
> Thanks Gilles for suggesting disabling the interconnect. it is a good fast
> test and yes, *with "mpirun --mca pml ob1 --mca btl tcp,self" I have no
> memory leak*. So this explain the differences between my laptop and the
> cluster.
> The implementation of type management is so different from 1.7.3  ?
>
> A PhD student tells me he has also some trouble with this code on a
> cluster Omnipath based. I will have to investigate too but not sure it is
> the same problem.
>
> Patrick
>
> Le 04/12/2020 à 01:34, Gilles Gouaillardet via users a écrit :
>
> Patrick,
>
>
> based on George's idea, a simpler check is to retrieve the Fortran index
> via the (standard) MPI_Type_c2() function
>
> after you create a derived datatype.
>
>
> If the index keeps growing forever even after you MPI_Type_free(), then
> this clearly indicates a leak.
>
> Unfortunately, this simple test cannot be used to definitely rule out any
> memory leak.
>
>
> Note you can also
>
> mpirun --mca pml ob1 --mca btl tcp,self ...
>
> in order to force communications over TCP/IP and hence rule out any memory
> leak that could be triggered by your fast interconnect.
>
>
>
> In any case, a reproducer will greatly help us debugging this issue.
>
>
> Cheers,
>
>
> Gilles
>
>
>
> On 12/4/2020 7:20 AM, George Bosilca via users wrote:
>
> Patrick,
>
> I'm afraid there is no simple way to check this. The main reason being
> that OMPI use handles for MPI objects, and these handles are not tracked by
> the library, they are supposed to be provided by the user for each call. In
> your case, as you already called MPI_Type_free on the datatype, you cannot
> produce a valid handle.
>
> There might be a trick. If the datatype is manipulated with any Fortran
> MPI functions, then we convert the handle (which in fact is a pointer) to
> an index into a pointer array structure. Thus, the index will remain used,
> and can therefore be used to convert back into a valid datatype pointer,
> until OMPI completely releases the datatype. Look into
> the ompi_datatype_f_to_c_table table to see the datatypes that exist and
> get their pointers, and then use these pointers as arguments to
> ompi_datatype_dump() to see if any of these existing datatypes are the ones
> you define.
>
> George.
>
>
>
>
> On Thu, Dec 3, 2020 at 4:44 PM Patrick Bégou via users <
> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> > wrote:
>
> Hi,
>
> I'm trying to solve a memory leak since my new implementation of
> communications based on MPI_AllToAllW and MPI_type_Create_SubArray
> calls.  Arrays of SubArray types are created/destroyed at each
> time step and used for communications.
>
> On my laptop the code runs fine (running for 15000 temporal
> itérations on 32 processes with oversubscription) but on our
> cluster memory used by the code increase until the OOMkiller stop
> the job. On the cluster we use IB QDR for communications.
>
> Same Gcc/Gfortran 7.3 (built from sources), same sources of
> OpenMPI (3.1 or 4.0.5 tested), same sources of the fortran code on
> the laptop and on the cluster.
>
> Using Gcc/Gfortran 4.8 and OpenMPI 1.7.3 on the cluster do not
> show the problem (resident memory do not increase and we ran
> 10 temporal iterations)
>
> MPI_type_free manual says that it "/Marks the datatype object
> associated with datatype for deallocation/". But  how can I check
> that the deallocation is really done ?
>
> Thanks for ant suggestions.
>
> Patrick
>
>
>


Re: [OMPI users] MPI_type_free question

2020-12-03 Thread George Bosilca via users
Patrick,

I'm afraid there is no simple way to check this. The main reason being that
OMPI use handles for MPI objects, and these handles are not tracked by the
library, they are supposed to be provided by the user for each call. In
your case, as you already called MPI_Type_free on the datatype, you cannot
produce a valid handle.

There might be a trick. If the datatype is manipulated with any Fortran MPI
functions, then we convert the handle (which in fact is a pointer) to an
index into a pointer array structure. Thus, the index will remain used, and
can therefore be used to convert back into a valid datatype pointer, until
OMPI completely releases the datatype. Look into
the ompi_datatype_f_to_c_table table to see the datatypes that exist and
get their pointers, and then use these pointers as arguments to
ompi_datatype_dump() to see if any of these existing datatypes are the ones
you define.

George.




On Thu, Dec 3, 2020 at 4:44 PM Patrick Bégou via users <
users@lists.open-mpi.org> wrote:

> Hi,
>
> I'm trying to solve a memory leak since my new implementation of
> communications based on MPI_AllToAllW and MPI_type_Create_SubArray calls.
> Arrays of SubArray types are created/destroyed at each time step and used
> for communications.
>
> On my laptop the code runs fine (running for 15000 temporal itérations on
> 32 processes with oversubscription) but on our cluster memory used by the
> code increase until the OOMkiller stop the job. On the cluster we use IB
> QDR for communications.
>
> Same Gcc/Gfortran 7.3 (built from sources), same sources of OpenMPI (3.1
> or 4.0.5 tested), same sources of the fortran code on the laptop and on the
> cluster.
>
> Using Gcc/Gfortran 4.8 and OpenMPI 1.7.3 on the cluster do not show the
> problem (resident memory do not increase and we ran 10 temporal
> iterations)
>
> MPI_type_free manual says that it "*Marks the datatype object associated
> with datatype for deallocation*". But  how can I check that the
> deallocation is really done ?
>
> Thanks for ant suggestions.
>
> Patrick
>


Re: [OMPI users] Vader - Where to Look for Shared Memory Use

2020-07-22 Thread George Bosilca via users
John,

There are many things in play in such an experiment. Plus, expecting linear
speedup even at the node level is certainly overly optimistic.

1. A single core experiment has full memory bandwidth, so you will
asymptotically reach the max flops. Adding more cores will increase the
memory pressure, and at some point the memory will not be able to deliver,
and will become the limiting factor (not the computation capabilities of
the cores).

2. HPL communication pattern is composed of 3 types of messages. 1 element
in the panel (column) in the context of an allreduce (to find the max),
medium size (a decreasing multiple of NB as you progress in the
computation) for the swap operation, and finally some large messages of
(NB*NB*sizeof(elem)) for the update. All this to say that CMA_SIZE_MBYTES=5
should be more than enough for you.

Have fun,
  George.



On Wed, Jul 22, 2020 at 2:19 PM John Duffy via users <
users@lists.open-mpi.org> wrote:

> Hi Joseph, John
>
> Thank you for your replies.
>
> I’m using Ubuntu 20.04 aarch64 on a 8 x Raspberry Pi 4 cluster.
>
> The symptoms I’m experiencing are that the HPL Linpack performance in
> Gflops increases on a single core as NB is increased from 32 to 256. The
> theoretical maximum is 6 Gflops per core. I can achieve 4.8 Gflops, which I
> think is a reasonable expectation. However, as I add more cores on a single
> node, 2, 3 and finally 4 cores, the performance scaling is nowhere near
> linear, and tails off dramatically as NB is increased. I can achieve 15
> Gflops on a single node of 4 cores, whereas the theoretical maximum is 24
> Gflops per node.
>
> opmi_info suggest vader is available/working…
>
>  MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.0.3)
>  MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.3)
>  MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.3)
>  MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.3)
>
> I’m wondering whether the Ubuntu kernel CMA_SIZE_MBYTES=5 is limiting
> Open-MPI message number/size. So, I’m currently building a new kernel with
> CMA_SIZE_MBYTES=16.
>
> I have attached 2 plots from my experiments…
>
> Plot 1 - shows an increase in Gflops for 1 core as NB increases, up to a
> maximum value of 4.75 Gflops when NB = 240.
>
> Plot 2 - shows an increase in Gflops for 4 x cores (all on same the same
> node) as NB increases. The maximum Gflops achieved is 15 Gflops. I would
> hope that rather than drop off dramatically at NB = 168, the performance
> would trend upwards towards somewhere near 4 x 4.75 = 19 Gflops.
>
> This is why I wondering whether Open-MPI messages via vader are being
> hampered by a limiting CMA size.
>
> Lets see what happens with my new kernel...
>
> Best regards
>
> John
>
>
>


Re: [OMPI users] Error with MPI_GET_ADDRESS and MPI_TYPE_CREATE_RESIZED?

2020-05-17 Thread George Bosilca via users
Diego,

I see nothing wrong with the way you create the datatype. In fact this is
the perfect example on how to almost do it right in FORTRAN. The almost is
because your code is highly dependent on the -r8 compiler option (otherwise
the REAL in your type will not match the MPI_DOUBLE_PRECISION you provide
to MPI_Type_create_struct).

Btw you can remove the MPI_Type_commit on the first datatype, you only need
to commit types that will be used in communications (not anything that is
temporarily used to build other types).

George.


On Sun, May 17, 2020 at 11:19 AM Diego Avesani via users <
users@lists.open-mpi.org> wrote:

> Dear Gilles, dear All,
>
> as far I remember no. The compiler is the same as the options which I use.
>
> Maybe, the error is some other places of my code. However, the results
> look errors in allocation of sent and received vector of datatype.
>
> The import is that at least mydata type definitions in correct.
>
> These are the options used in compiling:
>
> -c -O2 -r8 -align -lmkl_blas95_lp64 -lmkl_lapack95_lp64 -mkl=sequential
> -fpp -CB
>
> Best regards
>
>
> Diego
>
>
>
> On Sun, 17 May 2020 at 15:27, Gilles Gouaillardet via users <
> users@lists.open-mpi.org> wrote:
>
>> Diego,
>>
>> Did you change your compiler options?
>>
>> Cheers,
>>
>> Gilles
>>
>> - Original Message -
>> > Dear all,
>> >
>> > I would like to share with what I have done in oder to create my own
>> MPI
>> > data type. The strange thing is that it worked until some day ago and
>> then
>> > it stopped working. This because probably I have changed my data type
>> and I
>> > miss some knowledge about MPI data type
>> >
>> > This is my data type:
>> > **
>> >   TYPE tParticle
>> > SEQUENCE
>> > INTEGER  :: ip
>> > INTEGER :: mc
>> > INTEGER :: bcflag
>> > INTEGER :: cpu
>> > REAL   :: RP(di)
>> > REAL   :: QQ(nVar)
>> > REAL   :: UU0(2*(1+di+nVar))
>> > REAL   :: lij
>> > REAL   :: lmls
>> > REAL   :: vol
>> > REAL   :: mm
>> >   ENDTYPE tParticle
>> > **
>> >
>> > Then I have this variables in order to create mine:
>> >
>> > **
>> >   INTEGER,   ALLOCATABLE,DIMENSION(:)   :: TYPES,
>> LENGTHS
>> >   INTEGER(MPI_ADDRESS_KIND), ALLOCATABLE,DIMENSION(:)   ::
>> DISPLACEMENTS
>> >   TYPE(tParticle) :: dummy(2)
>> > **
>> >
>> > Having 5 structures:  Nstruct=5
>> >
>> > In the following how I create my MPI data TYPE
>> >
>> > **
>> > ALLOCATE(TYPES  (Nstruct))
>> > ALLOCATE(LENGTHS(Nstruct))
>> > ALLOCATE(DISPLACEMENTS  (0:nstruct+1))
>> > !set the types
>> > TYPES(1)   = MPI_INTEGER
>> > TYPES(2)   = MPI_DOUBLE_PRECISION
>> > TYPES(3)   = MPI_DOUBLE_PRECISION
>> > TYPES(4)   = MPI_DOUBLE_PRECISION
>> > TYPES(5)   = MPI_DOUBLE_PRECISION
>> > !set the lengths
>> > LENGTHS(1) = 4
>> > LENGTHS(2) = SIZE(dummy(1)%RP)
>> > LENGTHS(3) = SIZE(dummy(1)%QQ)
>> > LENGTHS(4) = SIZE(dummy(1)%UU0)
>> > LENGTHS(5) = 4
>> >   !
>> >   CALL MPI_GET_ADDRESS(dummy(1),DISPLACEMENTS(0),MPIdata%iErr)
>> >   CALL MPI_GET_ADDRESS(dummy(1)%ip ,DISPLACEMENTS(1),MPIdata%iErr)
>> >   CALL MPI_GET_ADDRESS(dummy(1)%RP(1)  ,DISPLACEMENTS(2),MPIdata%iErr)
>> >   CALL MPI_GET_ADDRESS(dummy(1)%QQ(1)  ,DISPLACEMENTS(3),MPIdata%iErr)
>> >   CALL MPI_GET_ADDRESS(dummy(1)%UU0(1) ,DISPLACEMENTS(4),MPIdata%iErr)
>> >   CALL MPI_GET_ADDRESS(dummy(1)%lij,DISPLACEMENTS(5),MPIdata%iErr)
>> >   CALL MPI_GET_ADDRESS(dummy(2),DISPLACEMENTS(6),MPIdata%iErr)
>> >   !
>> >   DISPLACEMENTS(1:nstruct+1)= DISPLACEMENTS(1:nstruct+1)-DISPLACEMENTS
>> (0)
>> >   !
>> >   CALL
>> > MPI_TYPE_CREATE_STRUCT(nstruct,lengths,DISPLACEMENTS(1:nstruct+1),
>> types,MPI_PARTICLE_TYPE_OLD,MPIdata%iErr)
>> >   CALL MPI_TYPE_COMMIT(MPI_PARTICLE_TYPE_OLD,MPIdata%iErr)
>> >   !
>> >   CALL MPI_TYPE_CREATE_RESIZED(MPI_PARTICLE_TYPE_OLD, DISPLACEMENTS(1),
>> > DISPLACEMENTS(6), MPI_PARTICLE_TYPE, MPIdata%iErr)
>> >   CALL MPI_TYPE_COMMIT(MPI_PARTICLE_TYPE,MPIdata%iErr)
>> >
>> >
>> > Do you see something wrong, maybe related to the DISPLACEMENTS. This
>> > beacuse, As already told you, something has happend and I have just
>> added
>> > "REAL   :: mm  "
>> >  in my type and consequently set "LENGTHS(5) = 4 ".
>> >
>> > What do you think?
>> > Thanks in advance for any kind of help.
>> >
>> > Best,
>> >
>> > Diego
>> >
>>
>>
>>


Re: [OMPI users] Regarding eager limit relationship to send message size

2020-03-26 Thread George Bosilca via users
An application that rely on MPI eager buffers for correctness or
performance is an incorrect application. Among many other points simply
because MPI implementations without support for eager are legit. Moreover,
these applications also miss the point on performance. Among the overheads
I am not only talking about the memory allocations by MPI to store the
eager data, or the additional memcpy needed to put that data back into
userland once the corresponding request is posted. But also about stressing
the unexpected messages path in the MPI library, creating potentially long
chains of unexpected messages that need to be traversed in order to
guarantee the FIFO matching required by MPI.

In the same idea as Jeff, if you want a portable and efficient MPI
application then assume eager is always 0 and prepost all your receives.

  George.

PS: In OMPI the eager size is provided by the underlying transport, so the
BTL, and can be changed this via MCA. 'ompi_info --param btl all -l 4 |
grep eager' should give you the full list.

On Thu, Mar 26, 2020 at 10:00 AM Jeff Squyres (jsquyres) 
wrote:

> On Mar 26, 2020, at 5:36 AM, Raut, S Biplab  wrote:
> >
> > I am doing pairwise send-recv and not all-to-all since not all the data
> is required by all the ranks.
> > And I am doing blocking send and recv calls since there are multiple
> iterations of such message chunks to be sent with synchronization.
> >
> > I understand your recommendation in the below mail, however I still see
> benefit for my application level algorithm to do pairwise send-recv chunks
> where each chunk is within eager limit.
> > Since the input and output buffer is same within the process, so I can
> avoid certain buffering at each sender rank by doing successive send calls
> within eager limit to receiver ranks and then have recv calls.
>
> But if the buffers are small enough to fall within the eager limit,
> there's very little benefit to not having an A/B buffering scheme.  Sure,
> it's 2x the memory, but it's 2 times a small number (measured in KB).
> Assuming you have GB of RAM, it's hard to believe that this would make a
> meaningful difference.  Indeed, one way to think of the eager limit is:
> "it's small enough that the cost of a memcpy doesn't matter."
>
> I'm not sure I understand your comments about preventing copying.  MPI
> will always do the most efficient thing to send the message, regardless of
> whether it is under the eager limit or not.  I also don't quite grok your
> comments about "application buffering" and message buffering required by
> the eager protocol.
>
> The short version of this is: you shouldn't worry about any of this.  Rely
> on the underlying MPI to do the most efficient thing possible, and you
> should use a communication algorithm that makes sense for your
> application.  In most cases, you'll be good.
>
> If you start trying to tune for a specific environment, platform, and MPI
> implementation, the number of variables grows exponentially.  And if you
> change any one parameter in the whole setup, your optimizations may get
> lost.  Also, if you add a bunch of infrastructure in your app to try to
> exactly match your environment+platform+implementation (e.g., manual
> segmenting to fit your overall message into the eager limit), you may just
> be adding additional overhead that effectively nullifies any optimization
> you might get (especially if the optimization is very small).  Indeed, the
> methods used for shared memory and similar to but different than the
> methods used for networks.  And there's a wide variety of network
> capabilities; some can be more efficient than others (depending on a
> zillion factors).
>
> If you're using shared memory, ensure that your Linux kernel has good
> shared memory support (e.g., support for CMA), and let MPI optimize the
> message transfers for you.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
>


Re: [OMPI users] Regarding eager limit relationship to send message size

2020-03-25 Thread George Bosilca via users
On Wed, Mar 25, 2020 at 4:49 AM Raut, S Biplab  wrote:

> [AMD Official Use Only - Internal Distribution Only]
>
>
>
> Dear George,
>
> Thank you the reply. But my question is more
> particularly on the message size from application side.
>
>
>
> Let’s say the application is running with 128 ranks.
>
> Each rank is doing send() msg to rest of 127 ranks where the msg length
> sent is under question.
>
> Now after all the sends are completed, each rank will recv() msg from rest
> of 127 ranks.
>
> Unless the msg length in the sending part is within eager_limit (4K size),
> this program will hang.
>

This is definitively not true, one can imagine many communication patterns
that will ensure correctness for your all-to-all communications. As an
example, you can place your processes in a virtual ring, and at each step
send and recv to/from process (my_rank + step) % comm_size. This
communication pattern will always be correct, independent of the eager size
(for as long as you correctly order the send/recv for each pair).

 So, based on the above scenario, my questions are:-
>

>1. Can each of the rank send message upto 4K size successfully, i.e
>all 128 ranks sending (128 * 4K) bytes simultaneously?
>
> Potentially yes, but there are physical constraints (aka number of network
links, switches capabilities, ... ) and memory limits. But if you have
enough memory, this could potentially work. I'm not saying this is correct
and should be done.


>1. If application has bigger msg to be sent by each rank, then how to
>derive the send message size? Is it equal to eager_limit and each rank
>needs to send multiple chunks of this size?
>
> Definitively not! You should never rely on the eager size to fix a complex
communication pattern. The rule of thumb should be: Is my application
working correctly if the MPI forces a zero-bytes eager size. As suggested
above, the most suitable approach is to define a communication scheme that
would never deadlock.

  George.


> With Regards,
>
> S. Biplab Raut
>
>
>
> *From:* George Bosilca 
> *Sent:* Tuesday, March 24, 2020 9:01 PM
> *To:* Open MPI Users 
> *Cc:* Raut, S Biplab 
> *Subject:* Re: [OMPI users] Regarding eager limit relationship to send
> message size
>
>
>
> [CAUTION: External Email]
>
> Biplab,
>
>
>
> The eager is a constant for each BTL, and it represent the data that is
> sent eagerly with the matching information out of the entire message. So,
> if the question is how much memory is needed to store all the
> eager messages then the answer will depend on the communication pattern of
> your application:
>
> - applications using only blocking messages might only have 1 pending
> communications per peer, so in the worst case any process will only need at
> most P * eager_size memory for local storage of the eager.
>
> - applications using non-blocking communications, there is basically no
> limit.
>
>
>
> However, the good news is that you can change this limit to adapt to the
> needs of your application(s).
>
>
>
> Hope this answers your question,
>
> George.
>
>
>
>
>
> On Tue, Mar 24, 2020 at 1:46 AM Raut, S Biplab via users <
> users@lists.open-mpi.org> wrote:
>
> Dear Experts,
>
> I would like to derive/calculate the maximum MPI
> send message size possible  given the known details of
> btl_vader_eager_limit and number of ranks.
>
> Can anybody explain and confirm on this?
>
>
>
> With Regards,
>
> S. Biplab Raut
>
>


Re: [OMPI users] Regarding eager limit relationship to send message size

2020-03-24 Thread George Bosilca via users
Biplab,

The eager is a constant for each BTL, and it represent the data that is
sent eagerly with the matching information out of the entire message. So,
if the question is how much memory is needed to store all the
eager messages then the answer will depend on the communication pattern of
your application:
- applications using only blocking messages might only have 1 pending
communications per peer, so in the worst case any process will only need at
most P * eager_size memory for local storage of the eager.
- applications using non-blocking communications, there is basically no
limit.

However, the good news is that you can change this limit to adapt to the
needs of your application(s).

Hope this answers your question,
George.


On Tue, Mar 24, 2020 at 1:46 AM Raut, S Biplab via users <
users@lists.open-mpi.org> wrote:

> Dear Experts,
>
> I would like to derive/calculate the maximum MPI
> send message size possible  given the known details of
> btl_vader_eager_limit and number of ranks.
>
> Can anybody explain and confirm on this?
>
>
>
> With Regards,
>
> S. Biplab Raut
>


Re: [OMPI users] Limits of communicator size and number of parallel broadcast transmissions

2020-03-17 Thread George Bosilca via users
On Mon, Mar 16, 2020 at 6:15 PM Konstantinos Konstantinidis via users <
users@lists.open-mpi.org> wrote:

> Hi, I have some questions regarding technical details of MPI collective
> communication methods and broadcast:
>
>- I want to understand when the number of receivers in a MPI_Bcast can
>be a problem slowing down the broadcast.
>
> This is a pretty strong claim. Do you mind sharing with us the data that
allowed you to reach such a conclusion?

>
>- There are a few implementations of MPI_Bcast. Consider that of a
>binary tree. In this case, the sender (root) transmits the common
>message to its two children and each them to two more and so on. Is it
>accurate to say that in each level of the tree all transmissions happen in
>parallel or only one transmission can be done from each node?
>
> Neither of those. Assuming the 2 children are both on different nodes, one
might not want to split the outgoing bandwidth of the parent between the 2
children, but instead order. In this case, one of the children will be
serviced first, while the other one is waiting, and then in a binary tree,
the second is serviced while the first is waiting. So, the binary tree
communication pattern maximize the outgoing bandwidth of some of the nodes,
but not the overall bisectional bandwidth of the machine.

This is not necessarily true for hardware-level collectives. If the
switches propagate the information they might be able to push the data out
to multiple other switches, more or less in parallel.


>
>- To that end, is there a limit on the number of processes a process
>can broadcast to in parallel?
>
> No? I am not sure I understand the context of this question. I would say
that as long as the collective communication is implemented over
point-to-point communications, the limit is 1.

>
>- Since each MPI_Bcast is associated with a communicator is there a
>limit on the number of processes a communicator can have and if so what is
>it in Open MPI?
>
> No, there is no such limit in MPI, or in Open MPI.

  George.



>
> Regards,
> Kostas
>


Re: [OMPI users] Fault in not recycling bsend buffer ?

2020-03-17 Thread George Bosilca via users
Martyn,

I don't know exactly what your code is doing, but based on your inquiry I
assume you are using MPI_BSEND multiple times and you run out of local
buffers.

The MPI standard does not mandate a wait until buffer space becomes
available, because that can lead to deadlocks (communication pattern
depends on a local receive that will be posted after the bsend loop).
Instead, the MPI standard states it is the user's responsibility to ensure
enough buffer is available before calling MPI_BSEND, MPI3.2 page 39 line
36, "then MPI must buffer the outgoing message, so as to allow the send to
complete. An error will occur if there is insufficient buffer space". For
blocking buffered sends this is a gray area because from a user perspective
it is difficult to know when you can safely reuse the buffer without
implementing some kind of feedback mechanism to confirm the reception. For
nonblocking the constraint is relaxed as indicated on page 55 line 33,
"Successful return of MPI_WAIT after a MPI_IBSEND implies that the user
buffer can be reused".

In short, you should always make sure you have enough available buffer
space for your buffered sends to be able to locally pack the data to be
sent, or be ready to deal with the error returned by MPI (this part would
not be portable across different MPI implementations).

  George.




On Tue, Mar 17, 2020 at 7:59 AM Martyn Foster via users <
users@lists.open-mpi.org> wrote:

> Hi all,
>
> I'm new here, so please be gentle :-)
>
> Versions: OpenMPI 4.0.3rc1, UCX 1.7
>
> I have a hang in an application (OK for small data sets, but fails with a
> larger one). The error is
>
> "bsend: failed to allocate buffer"
>
> This comes from
>
> pml_ucx.c:693
> mca_pml_ucx_bsend( ... )
> ...
> packed_data = mca_pml_base_bsend_request_alloc_buf(packed_length);
> if (OPAL_UNLIKELY(NULL == packed_data)) {
> OBJ_DESTRUCT(_conv);
> PML_UCX_ERROR( "bsend: failed to allocate buffer");
> return UCS_STATUS_PTR(OMPI_ERROR);
> }
>
> In fact the request appears to be 1.3MB and the bsend buffer is (should
> be!) 128MB
>
> In pml_base_bsend:332
> void*  mca_pml_base_bsend_request_alloc_buf( size_t length )
> {
>void* buf = NULL;
> /* has a buffer been provided */
> OPAL_THREAD_LOCK(_pml_bsend_mutex);
> if(NULL == mca_pml_bsend_addr) {
> OPAL_THREAD_UNLOCK(_pml_bsend_mutex);
> return NULL;
> }
>
> /* allocate a buffer to hold packed message */
> buf = mca_pml_bsend_allocator->alc_alloc(
> mca_pml_bsend_allocator, length, 0);
> if(NULL == buf) {
> /* release resources when request is freed */
> OPAL_THREAD_UNLOCK(_pml_bsend_mutex);
> /* progress communications, with the hope that more resources
>  *   will be freed */
> opal_progress();
> return NULL;
> }
>
> /* increment count of pending requests */
> mca_pml_bsend_count++;
> OPAL_THREAD_UNLOCK(_pml_bsend_mutex);
>
> return buf;
> }
>
> It seems that there is a strong hint here that we can wait for the bsend
> buffer to become available, and yet mca_pml_ucx_bsend doesn't have a retry
> mechanism and just fails on the first attempt. A simple hack to turn the 
> "if(NULL
> == buf) {" into a "while(NULL == buf) {"
> in mca_pml_base_bsend_request_alloc_buf seems to support this (the
> application proceeds after a few milliseconds)...
>
> Is this hypothesis correct?
>
> Best regards, Martyn
>
>
>


Re: [OMPI users] Trouble with Mellanox's hcoll component and MPI_THREAD_MULTIPLE support?

2020-02-04 Thread George Bosilca via users
Hcoll will be present in many cases, you don’t really want to skip them
all. I foresee 2 problem with the approach you propose:
- collective components are selected per communicator, so even if they will
not be used they are still loaded.
- from outside the MPI library you have little access to internal
information, especially to components that are loaded and actives.

I’m afraid the best solution is to prevent OMPI from loading the hcoll
component if you want to use threading, by adding ‘—mca coll ^hcoll’ to
your mpirun.

  George.


On Tue, Feb 4, 2020 at 8:32 AM Angel de Vicente 
wrote:

> Hi,
>
> George Bosilca  writes:
>
> > If I'm not mistaken, hcoll is playing with the opal_progress in a way
> > that conflicts with the blessed usage of progress in OMPI and prevents
> > other components from advancing and timely completing requests. The
> > impact is minimal for sequential applications using only blocking
> > calls, but is jeopardizing performance when multiple types of
> > communications are simultaneously executing or when multiple threads
> > are active.
> >
> > The solution might be very simple: hcoll is a module providing support
> > for collective communications so as long as you don't use collectives,
> > or the tuned module provides collective performance similar to hcoll
> > on your cluster, just go ahead and disable hcoll. You can also reach
> > out to Mellanox folks asking them to fix the hcoll usage of
> > opal_progress.
>
> until we find a more robust solution I was thinking on trying to just
> enquiry the MPI implementation at running time and use the threaded
> version if hcoll is not present and go for the unthreaded version if it
> is. Looking at the coll.h file I see that some functions there might be
> useful, for example: mca_coll_base_component_comm_query_2_0_0_fn_t, but
> I have never delved here. Would this be an appropriate approach? Any
> examples on how to enquiry in code for a particular component?
>
> Thanks,
> --
> Ángel de Vicente
>
> Tel.: +34 922 605 747
> Web.: http://research.iac.es/proyecto/polmag/
>
> -
> ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protección de
> Datos, acceda a http://www.iac.es/disclaimer.php
> WARNING: For more information on privacy and fulfilment of the Law
> concerning the Protection of Data, consult
> http://www.iac.es/disclaimer.php?lang=en
>
>


Re: [OMPI users] Trouble with Mellanox's hcoll component and MPI_THREAD_MULTIPLE support?

2020-02-03 Thread George Bosilca via users
If I'm not mistaken, hcoll is playing with the opal_progress in a way that
conflicts with the blessed usage of progress in OMPI and prevents other
components from advancing and timely completing requests. The impact is
minimal for sequential applications using only blocking calls, but is
jeopardizing performance when multiple types of communications are
simultaneously executing or when multiple threads are active.

The solution might be very simple: hcoll is a module providing support for
collective communications so as long as you don't use collectives, or the
tuned module provides collective performance similar to hcoll on your
cluster, just go ahead and disable hcoll. You can also reach out to
Mellanox folks asking them to fix the hcoll usage of opal_progress.

  George.


On Mon, Feb 3, 2020 at 11:09 AM Angel de Vicente via users <
users@lists.open-mpi.org> wrote:

> Hi,
>
> in one of our codes, we want to create a log of events that happen in
> the MPI processes, where the number of these events and their timing is
> unpredictable.
>
> So I implemented a simple test code, where process 0
> creates a thread that is just busy-waiting for messages from any
> process, and which is sent to stdout/stderr/log file upon receiving
> them. The test code is at https://github.com/angel-devicente/thread_io
> and the same idea went into our "real" code.
>
> As far as I could see, this behaves very nicely, there are no deadlocks,
> no lost messages and the performance penalty is minimal when considering
> the real application this is intended for.
>
> But then I found that in a local cluster the performance was very bad
> (from ~5min 50s to ~5s for some test) when run with the locally
> installed OpenMPI and my own OpenMPI installation (same gcc and OpenMPI
> versions). Checking the OpenMPI configuration details, I found that the
> locally installed OpenMPI was configured to use the Mellanox IB driver,
> and in particular the hcoll component was somehow killing performance:
>
> running with
>
> mpirun  --mca coll_hcoll_enable 0 -np 51 ./test_t
>
> was taking ~5s, while enabling coll_hcoll was killing performance, as
> stated above (when run in a single node the performance also goes down,
> but only about a factor 2X).
>
> Has anyone seen anything like this? Perhaps a newer Mellanox driver
> would solve the problem?
>
> We were planning on making our code public, but before we do so, I want
> to understand under which conditions we could have this problem with the
> "Threaded I/O" approach and if possible how to get rid of it completely.
>
> Any help/pointers appreciated.
> --
> Ángel de Vicente
>
> Tel.: +34 922 605 747
> Web.: http://research.iac.es/proyecto/polmag/
>
> -
> ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protección de
> Datos, acceda a http://www.iac.es/disclaimer.php
> WARNING: For more information on privacy and fulfilment of the Law
> concerning the Protection of Data, consult
> http://www.iac.es/disclaimer.php?lang=en
>
>


Re: [OMPI users] HELP: openmpi is not using the specified infiniband interface !!

2020-01-14 Thread George Bosilca via users
According to the error message you are using MPICH not Open MPI.

  George.


On Tue, Jan 14, 2020 at 5:53 PM SOPORTE MODEMAT via users <
users@lists.open-mpi.org> wrote:

> Hello everyone.
>
>
>
> I would like somebody help me to figure out how can I make that the
> openmpi use the infiniband interface that I specify with the command:
>
>
>
> /opt/mpi/openmpi_intel-2.1.1/bin/mpirun --mca btl self,openib python
> mpi_hola.py
>
>
>
> But when I  print the hostname and the ip address of the interface by the
> python script, it shows the ethernet interface:
>
>
>
> # Ejemplo de mpi4py
>
> # Funcion 'Hola mundo'
>
>
>
> from mpi4py import MPI
>
> import socket
>
>
>
> ##print(socket.gethostname())
>
>
>
> comm = MPI.COMM_WORLD
>
> rank = comm.Get_rank()
>
> size = comm.Get_size()
>
>
>
> print('Hola mundo')
>
> print('Proceso {} de {}'.format(rank, size))
>
>
>
> host_name = socket.gethostname()
>
> host_ip = socket.gethostbyname(host_name)
>
> print("Hostname :  ",host_name)
>
> print("IP : ",host_ip)
>
>
>
> The output is:
>
>
>
>
>
> Hola mundo
>
> Proceso 0 de 1
>
> Hostname :   apollo-2.private
>
> IP :  10.50.1.253
>
> Hostname :   apollo-2.private
>
> IP :  10.50.1.253
>
> Hostname :   apollo-2.private
>
> IP :  10.50.1.253
>
> Hostname :   apollo-2.private
>
> IP :  10.50.1.253
>
>
>
> But the node has already the infiniband interface configured on ib0 with
> another network.
>
>
>
> I would like you to give me some advice to make this script use the
> infiniband interface that is specified via mpi.
>
>
>
> When I run:
>
>
>
> mpirun --mca btl self,openib ./a.out
>
>
>
>
>
> I get this error that confirm that mpi is using the ethernet interface:
>
>
>
> [mpiexec@apollo-2.private] match_arg (./utils/args/args.c:122):
> unrecognized argument mca
>
> [mpiexec@apollo-2.private] HYDU_parse_array (./utils/args/args.c:140):
> argument matching returned error
>
> [mpiexec@apollo-2.private] parse_args (./ui/mpich/utils.c:1387): error
> parsing input array
>
> [mpiexec@apollo-2.private] HYD_uii_mpx_get_parameters
> (./ui/mpich/utils.c:1438): unable to parse user arguments
>
>
>
> Usage: ./mpiexec [global opts] [exec1 local opts] : [exec2 local opts] :
> ...
>
>
>
> Global options (passed to all executables):
>
>
>
>
>
> Additional Information:
>
>
>
> ll /sys/class/infiniband: mlx5_0
>
>
>
> /sys/class/net/
>
> total 0
>
> lrwxrwxrwx 1 root root 0 Jan 14 12:03 eno49 ->
> ../../devices/pci:00/:00:02.0/:07:00.0/net/eno49
>
> lrwxrwxrwx 1 root root 0 Jan 14 17:17 eno50 ->
> ../../devices/pci:00/:00:02.0/:07:00.1/net/eno50
>
> lrwxrwxrwx 1 root root 0 Jan 14 17:17 eno51 ->
> ../../devices/pci:00/:00:02.0/:07:00.2/net/eno51
>
> lrwxrwxrwx 1 root root 0 Jan 14 17:17 eno52 ->
> ../../devices/pci:00/:00:02.0/:07:00.3/net/eno52
>
> lrwxrwxrwx 1 root root 0 Jan 14 17:17 ib0 ->
> ../../devices/pci:00/:00:01.0/:06:00.0/net/ib0
>
> lrwxrwxrwx 1 root root 0 Jan 14 17:17 ib1 ->
> ../../devices/pci:00/:00:01.0/:06:00.0/net/ib1
>
> lrwxrwxrwx 1 root root 0 Jan 14 17:17 lo -> ../../devices/virtual/net/lo
>
>
>
>
>
> The operating system is Linux Centos 7.
>
>
>
>
>
> Thank you in advance for your help.
>
>
>
> Kind regards.
>
>
>
> Soporte.modemat
>


Re: [OMPI users] Non-blocking send issue

2020-01-02 Thread George Bosilca via users
This is going back to the fact that you, as a developer, are the best
placed to know exactly when asynchronous progress is needed for your
algorithm, so from that perspective you can provide that progress in the
most timely manner. One way to force MPI to do progress, is to spawn
another thread (using pthread_create as an example), dedicated to providing
a means to the MPI library to execute progress. This communication thread
can use a non-blocking ANY_SOURCE receive, using a tag that will never
match any other message. This way, you can safely cancel this pending
request upon completion of your app. You can then rely on this thread to
call MPI_Test when you know you need guaranteed progress, and the rest of
the time you can park it on some synchronization primitive
(mutex/condition).

  George.


On Tue, Dec 31, 2019 at 5:12 PM Martín Morales <
martineduardomora...@hotmail.com> wrote:

> Hi George, thank you very much for your answer. Can you please explain me
> a little more about "If you need to guarantee progress you might either
> have your own thread calling MPI functions (such as MPI_Test)". Regards
>
> Martín
>
> ----------
> *De:* George Bosilca 
> *Enviado:* martes, 31 de diciembre de 2019 13:47
> *Para:* Open MPI Users 
> *Cc:* Martín Morales 
> *Asunto:* Re: [OMPI users] Non-blocking send issue
>
> Martin,
>
> The MPI standard does not mandate progress outside MPI calls, thus
> implementations are free to provide, or not, asynchronous progress. Calling
> MPI_Test provides the MPI implementation with an opportunity to progress
> it's internal communication queues. However, an implementation could try a
> best effort to limit the time it spent in MPI_Test* and to provide the
> application with more time for computation, even when this might limit its
> own internal progress. Thus, as a non-blocking collective is composed of a
> potentially large number of point-to-point communications, it might require
> a significant number of MPI_Test to reach completion.
>
> If you need to guarantee progress you might either have your own thread
> calling MPI functions (such as MPI_Test) or you can use the asynchronous
> progress some MPI libraries provide. For this last option read the
> documentation of your MPI implementation to see how to enable asynchronous
> progress.
>
>   George.
>
>
> On Mon, Dec 30, 2019 at 2:31 PM Martín Morales via users <
> users@lists.open-mpi.org> wrote:
>
> Hello all!
> Im with OMPI 4.0.1 and I have a strange behaviour (or at least,
> unexpected) with some non-blocking sending calls: MPI_Isend and MPI_Ibcast.
> I really need asyncronous sending so I dont use MPI_Wait after the send
> call (MPI_Isend or MPI_Ibcast); insted of this I check "on demand" with
> MPI_Test to verify if sending its or not complete. Test Im doing it sends
> just an int value. Here some code (with MPI_Ibcast):
>
> ***SENDER***
>
> //Note that It use an intercommunicator
> MPI_Ibcast(_some_int_data, 1, MPI_INT, MPI_ROOT, mpi_intercomm,
> _sender);
> //MPI_Wait(_sender, MPI_STATUS_IGNORE); <-- I dont want this
>
>
> ***RECEIVER***
>
> MPI_Ibcast(_some_int_data, 1, MPI_INT, 0, parentcomm,
> _receiver);
> MPI_Wait(_receiver, MPI_STATUS_IGNORE);
>
> ***TEST RECEPTION (same sender instance program)***
>
> void test_reception() {
>
> int request_complete;
>
> MPI_Test(_sender, _complete, MPI_STATUS_IGNORE);
>
> if (request_complete) {
> ...
> } else {
> ...
> }
>
> }
>
> But when I invoke this test function after some time has elapsed since I
> sent, the request isnt complete and i have to invoque this test function
> again and againg... x (variable) times, until it finally its completed. Its
> just an int it was sended, just that (all on a local machine); has no sense
> such delay. The request should be completed on the first function test
> invocation.
>
> If, instead of this, I uncomment the unwanted MPI_Wait (i.e. doing it like
> a synchronous request), it completes immediately, like expected.
> If I send with MPI_Isend I get the same behaviour.
>
> I dont understand whats is going on. Any help will be very appreciated.
>
> Regards.
>
> Martín
>
>


Re: [OMPI users] Non-blocking send issue

2019-12-31 Thread George Bosilca via users
Martin,

The MPI standard does not mandate progress outside MPI calls, thus
implementations are free to provide, or not, asynchronous progress. Calling
MPI_Test provides the MPI implementation with an opportunity to progress
it's internal communication queues. However, an implementation could try a
best effort to limit the time it spent in MPI_Test* and to provide the
application with more time for computation, even when this might limit its
own internal progress. Thus, as a non-blocking collective is composed of a
potentially large number of point-to-point communications, it might require
a significant number of MPI_Test to reach completion.

If you need to guarantee progress you might either have your own thread
calling MPI functions (such as MPI_Test) or you can use the asynchronous
progress some MPI libraries provide. For this last option read the
documentation of your MPI implementation to see how to enable asynchronous
progress.

  George.


On Mon, Dec 30, 2019 at 2:31 PM Martín Morales via users <
users@lists.open-mpi.org> wrote:

> Hello all!
> Im with OMPI 4.0.1 and I have a strange behaviour (or at least,
> unexpected) with some non-blocking sending calls: MPI_Isend and MPI_Ibcast.
> I really need asyncronous sending so I dont use MPI_Wait after the send
> call (MPI_Isend or MPI_Ibcast); insted of this I check "on demand" with
> MPI_Test to verify if sending its or not complete. Test Im doing it sends
> just an int value. Here some code (with MPI_Ibcast):
>
> ***SENDER***
>
> //Note that It use an intercommunicator
> MPI_Ibcast(_some_int_data, 1, MPI_INT, MPI_ROOT, mpi_intercomm,
> _sender);
> //MPI_Wait(_sender, MPI_STATUS_IGNORE); <-- I dont want this
>
>
> ***RECEIVER***
>
> MPI_Ibcast(_some_int_data, 1, MPI_INT, 0, parentcomm,
> _receiver);
> MPI_Wait(_receiver, MPI_STATUS_IGNORE);
>
> ***TEST RECEPTION (same sender instance program)***
>
> void test_reception() {
>
> int request_complete;
>
> MPI_Test(_sender, _complete, MPI_STATUS_IGNORE);
>
> if (request_complete) {
> ...
> } else {
> ...
> }
>
> }
>
> But when I invoke this test function after some time has elapsed since I
> sent, the request isnt complete and i have to invoque this test function
> again and againg... x (variable) times, until it finally its completed. Its
> just an int it was sended, just that (all on a local machine); has no sense
> such delay. The request should be completed on the first function test
> invocation.
>
> If, instead of this, I uncomment the unwanted MPI_Wait (i.e. doing it like
> a synchronous request), it completes immediately, like expected.
> If I send with MPI_Isend I get the same behaviour.
>
> I dont understand whats is going on. Any help will be very appreciated.
>
> Regards.
>
> Martín
>


Re: [OMPI users] CUDA mpi question

2019-11-28 Thread George Bosilca via users
Wonderful maybe but extremely unportable. Thanks but no thanks!

  George.

On Wed, Nov 27, 2019 at 11:07 PM Zhang, Junchao  wrote:

> Interesting idea. But doing MPI_THREAD_MULTIPLE has other side-effects. If
> MPI nonblocking calls could take an extra stream argument and work like a
> kernel launch, it would be wonderful.
> --Junchao Zhang
>
>
> On Wed, Nov 27, 2019 at 6:12 PM Joshua Ladd  wrote:
>
>> Why not spawn num_threads, where num_threads is the number of Kernels to
>> launch , and compile with the “--default-stream per-thread” option?
>>
>>
>>
>> Then you could use MPI in thread multiple mode to achieve your objective.
>>
>>
>>
>> Something like:
>>
>>
>>
>>
>>
>>
>>
>> void *launch_kernel(void *dummy)
>>
>> {
>>
>> float *data;
>>
>> cudaMalloc(, N * sizeof(float));
>>
>>
>>
>> kernel<<>>(data, N);
>>
>>
>>
>> cudaStreamSynchronize(0);
>>
>>
>>
>> MPI_Isend(data,..);
>>
>> return NULL;
>>
>> }
>>
>>
>>
>> int main()
>>
>> {
>>
>> MPI_init_thread(,,MPI_THREAD_MULTIPLE,);
>>
>> const int num_threads = 8;
>>
>>
>>
>> pthread_t threads[num_threads];
>>
>>
>>
>> for (int i = 0; i < num_threads; i++) {
>>
>> if (pthread_create([i], NULL, launch_kernel, 0)) {
>>
>> fprintf(stderr, "Error creating threadn");
>>
>> return 1;
>>
>> }
>>
>> }
>>
>>
>>
>> for (int i = 0; i < num_threads; i++) {
>>
>> if(pthread_join(threads[i], NULL)) {
>>
>> fprintf(stderr, "Error joining threadn");
>>
>> return 2;
>>
>> }
>>
>> }
>>
>> cudaDeviceReset();
>>
>>
>>
>> MPI_Finalize();
>>
>> }
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *From:* users  *On Behalf Of *Zhang,
>> Junchao via users
>> *Sent:* Wednesday, November 27, 2019 5:43 PM
>> *To:* George Bosilca 
>> *Cc:* Zhang, Junchao ; Open MPI Users <
>> users@lists.open-mpi.org>
>> *Subject:* Re: [OMPI users] CUDA mpi question
>>
>>
>>
>> I was pointed to "2.7. Synchronization and Memory Ordering" of
>> https://docs.nvidia.com/pdf/GPUDirect_RDMA.pdf
>> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.nvidia.com%2Fpdf%2FGPUDirect_RDMA.pdf=02%7C01%7Cjoshual%40mellanox.com%7C49083a368cab46dbbed908d7738b9386%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C1%7C637104915623515051=OLP7ptjhpg3Esqzff9g7%2B7hKWH6xsdRY6HjU2RL01Z8%3D=0>.
>> It is on topic. But unfortunately it is too short and I could not
>> understand it.
>>
>> I also checked cudaStreamAddCallback/cudaLaunchHostFunc, which say the
>> host function "must not make any CUDA API calls". I am not sure if
>> MPI_Isend qualifies as such functions.
>>
>> --Junchao Zhang
>>
>>
>>
>>
>>
>> On Wed, Nov 27, 2019 at 4:18 PM George Bosilca 
>> wrote:
>>
>> On Wed, Nov 27, 2019 at 5:02 PM Zhang, Junchao 
>> wrote:
>>
>> On Wed, Nov 27, 2019 at 3:16 PM George Bosilca 
>> wrote:
>>
>> Short and portable answer: you need to sync before the Isend or you will
>> send garbage data.
>>
>> Ideally, I want to formulate my code into a series of asynchronous
>> "kernel launch, kernel launch, ..." without synchronization, so that I can
>> hide kernel launch overhead. It now seems I have to sync before MPI calls
>> (even nonblocking calls)
>>
>>
>>
>> Then you need a means to ensure sequential execution, and this is what
>> the streams provide. Unfortunately, I looked into the code and I'm afraid
>> there is currently no realistic way to do what you need. My previous
>> comment was based on an older code, that seems to be 1)
>> unmaintained currently, and 2) only applicable to the OB1 PML + OpenIB BTL
>> combo. As recent versions of OMPI have moved away from the OpenIB BTL,
>> relying more heavily on UCX for Infiniband support, the old code is now
>> deprecated. Sorry for giving you hope on this.
>>
>>
>>
>> Maybe you can delegate the MPI call into a CUDA event callback ?
>>
>>
>>
>>   George.
>>
>>
>>
>>
>>
>>
>>

Re: [OMPI users] CUDA mpi question

2019-11-27 Thread George Bosilca via users
On Wed, Nov 27, 2019 at 5:02 PM Zhang, Junchao  wrote:

> On Wed, Nov 27, 2019 at 3:16 PM George Bosilca 
> wrote:
>
>> Short and portable answer: you need to sync before the Isend or you will
>> send garbage data.
>>
> Ideally, I want to formulate my code into a series of asynchronous "kernel
> launch, kernel launch, ..." without synchronization, so that I can hide
> kernel launch overhead. It now seems I have to sync before MPI calls (even
> nonblocking calls)
>

Then you need a means to ensure sequential execution, and this is what the
streams provide. Unfortunately, I looked into the code and I'm afraid there
is currently no realistic way to do what you need. My previous comment was
based on an older code, that seems to be 1) unmaintained currently, and 2)
only applicable to the OB1 PML + OpenIB BTL combo. As recent versions of
OMPI have moved away from the OpenIB BTL, relying more heavily on UCX for
Infiniband support, the old code is now deprecated. Sorry for giving you
hope on this.

Maybe you can delegate the MPI call into a CUDA event callback ?

  George.



>
>
>>
>> Assuming you are willing to go for a less portable solution you can get
>> the OMPI streams and add your kernels inside, so that the sequential order
>> will guarantee correctness of your isend. We have 2 hidden CUDA streams in
>> OMPI, one for device-to-host and one for host-to-device, that can be
>> queried with the non-MPI standard compliant functions
>> (mca_common_cuda_get_dtoh_stream and mca_common_cuda_get_htod_stream).
>>
>> Which streams (dtoh or htod) should I use to insert kernels producing
> send data and kernels using received data? I imagine MPI uses GPUDirect
> RDMA to move data directly from GPU to NIC. Why do we need to bother dtoh
> or htod streams?
>

>
>> George.
>>
>>
>> On Wed, Nov 27, 2019 at 4:02 PM Zhang, Junchao via users <
>> users@lists.open-mpi.org> wrote:
>>
>>> Hi,
>>>   Suppose I have this piece of code and I use cuda-aware MPI,
>>>   cudaMalloc(,sz);
>>>
>>>Kernel1<<<...,stream>>>(...,sbuf);
>>>MPI_Isend(sbuf,...);
>>>Kernel2<<<...,stream>>>();
>>>
>>>
>>>   Do I need to call cudaStreamSynchronize(stream) before MPI_Isend() to
>>> make sure data in sbuf is ready to send?  If not, why?
>>>
>>>   Thank you.
>>>
>>> --Junchao Zhang
>>>
>>


Re: [OMPI users] CUDA mpi question

2019-11-27 Thread George Bosilca via users
Short and portable answer: you need to sync before the Isend or you will
send garbage data.

Assuming you are willing to go for a less portable solution you can get the
OMPI streams and add your kernels inside, so that the sequential order will
guarantee correctness of your isend. We have 2 hidden CUDA streams in OMPI,
one for device-to-host and one for host-to-device, that can be queried with
the non-MPI standard compliant functions (mca_common_cuda_get_dtoh_stream
and mca_common_cuda_get_htod_stream).

George.


On Wed, Nov 27, 2019 at 4:02 PM Zhang, Junchao via users <
users@lists.open-mpi.org> wrote:

> Hi,
>   Suppose I have this piece of code and I use cuda-aware MPI,
>   cudaMalloc(,sz);
>
>Kernel1<<<...,stream>>>(...,sbuf);
>MPI_Isend(sbuf,...);
>Kernel2<<<...,stream>>>();
>
>
>   Do I need to call cudaStreamSynchronize(stream) before MPI_Isend() to
> make sure data in sbuf is ready to send?  If not, why?
>
>   Thank you.
>
> --Junchao Zhang
>


Re: [OMPI users] Program hangs when MPI_Bcast is called rapidly

2019-10-29 Thread George Bosilca via users
Charles,

Having implemented some of the underlying collective algorithms, I am
puzzled by the need to force the sync to 1 to have things flowing. I would
definitely appreciate a reproducer so that I can identify (and hopefully)
fix the underlying problem.

Thanks,
  George.


On Tue, Oct 29, 2019 at 2:20 PM Garrett, Charles via users <
users@lists.open-mpi.org> wrote:

> Last time I did a reply on here, it created a new thread.  Sorry about
> that everyone.  I just hit the Reply via email button.  Hopefully this one
> will work.
>
>
>
> To Gilles Gouaillardet:
>
> My first thread has a reproducer that causes the problem.
>
>
>
> To Beorge Bosilca:
>
> I had to set coll_sync_barrier_before=1.  Even setting to 10 did not fix
> my problem.  I was surprised by this and I’m still surprised given your
> comment on setting to larger than a few tens.  Thanks for the explanation
> about the problem.
>
>
>
> Charles Garrett
>


Re: [OMPI users] Program hangs when MPI_Bcast is called rapidly

2019-10-29 Thread George Bosilca via users
Charles,

There is a known issue with calling collectives on a tight loop, due to
lack of control flow at the network level. It results in a significant
slow-down, that might appear as a deadlock to users. The work around this
is to enable the sync collective module, that will insert a fake barrier at
regular intervals in the tight collective loop, allowing a more streamlined
usage of the network.

Run `ompi_info --param coll sync -l 9` to see the options you need to play
with. I think setting one of the coll_sync_barrier_before
or coll_sync_barrier_after to anything larger than a few tens should be
good enough.

  George.


On Mon, Oct 28, 2019 at 9:29 PM Gilles Gouaillardet via users <
users@lists.open-mpi.org> wrote:

> Charles,
>
>
> unless you expect yes or no answers, can you please post a simple
> program that evidences
>
> the issue you are facing ?
>
>
> Cheers,
>
>
> Gilles
>
> On 10/29/2019 6:37 AM, Garrett, Charles via users wrote:
> >
> > Does anyone have any idea why this is happening?  Has anyone seen this
> > problem before?
> >
>


Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread George Bosilca via users
To completely disable UCX you need to disable the UCX MTL and not only the
BTL. I would use "--mca pml ob1 --mca btl ^ucx —mca btl_openib_allow_ib 1".

As you have a gdb session on the processes you can try to break on some of
the memory allocations function (malloc, realloc, calloc).

  George.


On Wed, Jun 19, 2019 at 2:37 PM Noam Bernstein via users <
users@lists.open-mpi.org> wrote:

> I tried to disable ucx (successfully, I think - I replaced the “—mca btl
> ucx —mca btl ^vader,tcp,openib” with “—mca btl_openib_allow_ib 1”, and
> attaching gdb to a running process shows no ucx-related routines active).
> It still has the same fast growing (1 GB/s) memory usage problem.
>
>
>   Noam
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Can displs in Scatterv/Gatherv/etc be a GPU array for CUDA-aware MPI?

2019-06-11 Thread George Bosilca via users
Leo,

In a UMA system having the displacement and/or recvcounts arrays on managed
GPU memory should work, but it will incur overheads for at least 2 reasons:
1. the MPI API arguments are checked for correctness (here recvcounts)
2. the collective algorithm part that executes on the CPU uses the
displacements and recvcounts to issue and manage communications and it
therefore need access to both.

Moreover, as you mention your code will not be portable anymore.

  George.


On Tue, Jun 11, 2019 at 11:27 AM Fang, Leo via users <
users@lists.open-mpi.org> wrote:

> Hello,
>
>
> I understand that once Open MPI is built against CUDA, sendbuf/recvbuf can
> be pointers to GPU memory. I wonder whether or not the “displs" argument of
> the collective calls on variable data (Scatterv/Gatherv/etc) can also live
> on GPU. CUDA awareness isn’t part of the MPI standard (yet), so I suppose
> it’s worth asking or even documenting.
>
> Thank you.
>
>
> Sincerely,
> Leo
>
> ---
> Yao-Lung Leo Fang
> Assistant Computational Scientist
> Computational Science Initiative
> Brookhaven National Laboratory
> Bldg. 725, Room 2-169
> P.O. Box 5000, Upton, NY 11973-5000
> Office: (631) 344-3265
> Email: leof...@bnl.gov
> Website: https://leofang.github.io/
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Open questions on MPI_Allreduce background implementation

2019-06-08 Thread George Bosilca via users
There is an ongoing discussion about this on issue #4067 (
https://github.com/open-mpi/ompi/issues/4067). Also the mailing list
contains few examples on how to tweak the collective algorithms to your
needs.

  George.


On Thu, Jun 6, 2019 at 7:42 PM hash join via users 
wrote:

> Hi all,
>
>
> I want to use the MPI_Allreduce to synchronize results from all the
> processes:
> https://github.com/open-mpi/ompi/blob/master/ompi/mpi/c/allreduce.c
>
> But I want to know the background algorithm to implement Allreduce. When I
> check the coll base of MPI_Allreduce:
> https://github.com/open-mpi/ompi/blob/2ae3cfd9bc9aa8cab80986a1921fd7ad9d198d07/ompi/mca/coll/base/coll_base_allreduce.c
>
> There are implementations like:
> ompi_coll_base_allreduce_intra_nonoverlapping,
> ompi_coll_base_allreduce_intra_recursivedoubling,
> ompi_coll_base_allreduce_intra_ring..etc. So how can I know which exact
> implement I am using when I use MPI_Allreduce. Or can I configure these?
> For example, I only want to use ring allreduce. How should I configure it?
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OMPI 4.0.1 valgrind error on simple MPI_Send()

2019-04-30 Thread George Bosilca via users
Depending on the alignment of the different types there might be small
holes in the low-level headers we exchange between processes It should not
be a concern for users.

valgrind should not stop on the first detected issue except
if --exit-on-first-error has been provided (the default value should be
no), so the SIGTERM might be generated for some other reason. What is
at jackhmmer.c:1597 ?

  George.


On Tue, Apr 30, 2019 at 2:27 PM David Mathog via users <
users@lists.open-mpi.org> wrote:

> Attempting to debug a complex program (99.9% of which is others' code)
> which stops running when run in valgrind as follows:
>
> mpirun -np 10 \
>--hostfile /usr/common/etc/openmpi.machines.LINUX_INTEL_newsaf_rev2 \
>--mca plm_rsh_agent rsh \
>/usr/bin/valgrind \
>  --leak-check=full \
>  --leak-resolution=high \
>  --show-reachable=yes \
>  --log-file=nc.vg.%p \
>  --suppressions=/opt/ompi401/share/openmpi/openmpi-valgrind.supp \
> /usr/common/tmp/jackhmmer  \
>--tformat ncbi \
>-T 150  \
>--chkhmm jackhmmer_test \
>--mpi \
>~safrun/a1hu.pfa \
>/usr/common/tmp/testing/nr_lcl \
>>jackhmmer_test_mpi.out 2>jackhmmer_test_mpi.stderr &
>
> Every one of the nodes has a variant of this in the log file (followed
> by a long list
> of memory allocation errors, since it exits without being able to clean
> anything up):
>
> ==5135== Memcheck, a memory error detector
> ==5135== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
> ==5135== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright
> info
> ==5135== Command: /usr/common/tmp/jackhmmer --tformat ncbi -T 150
> --chkhmm jackhmmer_test --mpi /ulhhmi/safrun
> /a1hu.pfa /usr/common/tmp/testing/nr_lcl
> ==5135== Parent PID: 5119
> ==5135==
> ==5135== Syscall param socketcall.sendto(msg) points to uninitialised
> byte(s)
> ==5135==at 0x5459BFB: send (in /usr/lib64/libpthread-2.17.so)
> ==5135==by 0xF84A282: mca_btl_tcp_send_blocking (in
> /opt/ompi401/lib/openmpi/mca_btl_tcp.so)
> ==5135==by 0xF84E414: mca_btl_tcp_endpoint_send_handler (in
> /opt/ompi401/lib/openmpi/mca_btl_tcp.so)
> ==5135==by 0x5D6E4EF: event_persist_closure (event.c:1321)
> ==5135==by 0x5D6E4EF: event_process_active_single_queue
> (event.c:1365)
> ==5135==by 0x5D6E4EF: event_process_active (event.c:1440)
> ==5135==by 0x5D6E4EF: opal_libevent2022_event_base_loop
> (event.c:1644)
> ==5135==by 0x5D2465F: opal_progress (in
> /opt/ompi401/lib/libopen-pal.so.40.20.1)
> ==5135==by 0xF36A9CC: ompi_request_wait_completion (in
> /opt/ompi401/lib/openmpi/mca_pml_ob1.so)
> ==5135==by 0xF36C30E: mca_pml_ob1_send (in
> /opt/ompi401/lib/openmpi/mca_pml_ob1.so)
> ==5135==by 0x51BC581: PMPI_Send (in
> /opt/ompi401/lib/libmpi.so.40.20.1)
> ==5135==by 0x40B46E: mpi_worker (jackhmmer.c:1560)
> ==5135==by 0x406726: main (jackhmmer.c:413)
> ==5135==  Address 0x1ffefff8d5 is on thread 1's stack
> ==5135==  in frame #2, created by mca_btl_tcp_endpoint_send_handler
> (???:)
> ==5135==
> ==5135==
> ==5135== Process terminating with default action of signal 15 (SIGTERM)
> ==5135==at 0x5459EFD: ??? (in /usr/lib64/libpthread-2.17.so)
> ==5135==by 0x408817: mpi_failure (jackhmmer.c:887)
> ==5135==by 0x40B708: mpi_worker (jackhmmer.c:1597)
> ==5135==by 0x406726: main (jackhmmer.c:413)
>
> jackhmmer line 1560 is just this:
>
>
>  MPI_Send(, 1, MPI_INT, 0, HMMER_SETUP_READY_TAG,
> MPI_COMM_WORLD);
>
> preceded at varying distances by:
>
>int  status   = eslOK;
>status = 0;
>
> I can see why MPI might have some uninitialized bytes in that send, for
> instance, if it has a minimum buffer size it will send or something like
> that.  The problem is that it completely breaks valgrind in this
> application because valgrind exits immediately when it sees this error.
> The suppression file supplied with the release does not prevent that.
>
> How do I work around this?
>
> Thank you,
>
> David Mathog
> mat...@caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] 3.0.4, 4.0.1 build failure on OSX Mojave with LLVM

2019-04-24 Thread George Bosilca via users
Jon,

The configure AC_HEADER_STDC macro is considered obsolete [1] as most of
the OSes are STDC compliant nowadays. To have it failing on a recent
version of OSX, is therefore something unexpected. Moreover, many of the
OMPI developers work on OSX Mojave with the default compiler but with the
same system headers as you, so I expect the problem is not coming from the
OS but from how your compiler has been installed. To add to this assumption
I can successfully compile master and 4.0 on my laptop (Mojave) with the
default compile, the macport clang 7.0 package, the macport gcc 7 and 8,
and icc 19 20180804.

I would start investigating with the most basic config.log you can provide
(no STDC related flags), to see if we can figure out why autoconf fails to
detect the STDC compliance.

  George.

[1]
https://www.gnu.org/software/autoconf/manual/autoconf-2.67/html_node/Particular-Headers.html

On Wed, Apr 24, 2019 at 10:07 AM JR Cary via users 
wrote:

>
>
> On 4/24/19 7:25 AM, Gilles Gouaillardet wrote:
> > John,
> >
> > what if you move some parameters to CPPFLAGS and CXXCPPFLAGS (see the
> > new configure command line below)
>
> Thanks, Gilles, that did in fact work for 3.3.4.  I was writing up an email
> when yours came in, so I'll move my response to here:
>
> Jeff, I've now tried several things, so I have some ideas, and I can
> send you one
> or more config.log's, as you wish.  Will mention options at end of this.
>
> First, I got the compile to work, below, with -DOPAL_STDC_HEADERS added
> to CFLAGS,
> but then subsequent packages failed to build for not knowing ptrdiff_t.
> So it
> appeared that I would have to have -DOPAL_STDC_HEADERS set for all
> dependent
> packaged.  So then I configured with CPPFLAGS set to the same as CFLAGS,
> and
> for 3.1.4, it configured, built, installed, and dependent packages did
> as well.
> For 4.0.1, it did not.
>
> Some responses below:
>
>  A few notes on your configure line:
> 
> >
> '/Users/cary/projects/ulixesall-llvm/builds/openmpi-4.0.1/nodl/../configure'
> \
> >
> --prefix=/Volumes/GordianStorage/opt/contrib-llvm7_appleclang/openmpi-4.0.1-nodl
> \
> >
> CC='/Volumes/GordianStorage/opt/clang+llvm-7.0.0-x86_64-apple-darwin/bin/clang'
> \
> >
> CXX='/Volumes/GordianStorage/opt/clang+llvm-7.0.0-x86_64-apple-darwin/bin/clang++'
> \
>  The MPI C++ bindings are no longer built by default, and you're not
> enabling them (via --enable-mpi-cxx), so you don't need to specify CXX or
> CXXFLAGS here.
>
> Thanks for this and your other comments on how to minimize our
> configuration
> line.  Will adopt.
>
>
> >FC='/opt/homebrew/bin/gfortran-6' \
> >F77='/opt/homebrew/bin/gfortran-6' \
>  F77 is ignored these days; FC is the only one that matters.
>
> Thanks.  Will clean up.
>
>
> >CFLAGS='-fvisibility=default -mmacosx-version-min=10.10
> -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include
> -fPIC -pipe -DSTDC_HEADERS' \
>  Do you need all these CFLAGS?  E.g., does clang not -I that directory
> by default (I don't actually know if it is necessary or not)?  What does
> -DSTDC_HEADERS do?
>
> -fvisibility=default: found needed for Boost, and kept through entire
> chain.  May revisit.
>
> -mmacosx-version-min=10.10: Gives a build that is guaranteed to work back
> through Yosemite, when
> we distribute our binaries.
>
> -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include:
> Apple is now putting nothing in /usr/include.  The sdk then provides its
> unique /usr/include above.  However, the LLVM latest (which gives openmp),
> does not know
> that, and so it has to be told which /usr/include to use.
>
> One can install command-line tools, but then one cannot have several sdk's
> co-exist.
>
> -fPIC: duplicative given --with-pic, but no harm, passed to all our
> packages.
>
> -pipe: supposedly build performance, but I have not tested.
>
> I *think* -DSTDC_HEADERS is a hint to Apple (but not to openmpi) to use
> only headers
> consistent with stdc.
>
>  Can you send all the information listed here:
> 
>   https://www.open-mpi.org/community/help/
>
> I have the following builds/build attempts:
>
> 3.3.4:
>
> 1. Not adding -DOPAL_STDC_HEADERS: build fails.
>
> 2. Adding -DOPAL_STDC_HEADERS: build succeeds, hdf5 parallel build fails.
>
> 3. Not adding -DOPAL_STDC_HEADERS, but setting CPPFLAGS to be identical
> to CFLAGS:
> openmpi and hdf5-parallel builds both succeed.
>
> 4.0.1, best formula above (CPPFLAGS set to same as CFLAGS):
>
> 4. Fails to build for a different reason.
>
> Which would you like?
>
> Best..John
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread George Bosilca
I was not able to reproduce the issue with openib on the 4.0, but instead I
randomly segfault in MPI finalize during the grdma cleanup.

I could however reproduce the TCP timeout part with both 4.0 and master, on
a pretty sane cluster (only 3 interfaces, lo, eth0 and virbr0). With no
surprise, the timeout was triggered by a busted TCP interfaces selection
mechanism. As soon as I exclude the virbr0 interface, everything goes back
to normal.

  George.

On Wed, Feb 20, 2019 at 5:20 PM Adam LeBlanc  wrote:

> Hello Howard,
>
> Thanks for all of the help and suggestions I will look into them. I also
> realized that my ansible wasn't setup properly for handling tar files so
> the nightly build didn't even install, but will do it by hand and will give
> you an update tomorrow somewhere in the afternoon.
>
> Thanks,
> Adam LeBlanc
>
> On Wed, Feb 20, 2019 at 4:26 PM Howard Pritchard 
> wrote:
>
>> Hello Adam,
>>
>> This helps some.  Could you post first 20 lines of you config.log.  This
>> will
>> help in trying to reproduce.  The content of your host file (you can use
>> generic
>> names for the nodes if that'a an issue to publicize) would also help as
>> the number of nodes and number of MPI processes/node impacts the way
>> the reduce scatter operation works.
>>
>> One thing to note about the openib BTL - it is on life support.   That's
>> why you needed to set btl_openib_allow_ib 1 on the mpirun command line.
>>
>> You may get much better success by installing UCX
>>  and rebuilding Open MPI to use
>> UCX.  You may actually already have UCX installed on your system if
>> a recent version of MOFED is installed.
>>
>> You can check this by running /usr/bin/ofed_rpm_info.  It will show which
>> ucx version has been installed.
>> If UCX is installed, you can add --with-ucx to the Open MPi configuration
>> line and it should build in UCX
>> support.   If Open MPI is built with UCX support, it will by default use
>> UCX for message transport rather than
>> the OpenIB BTL.
>>
>> thanks,
>>
>> Howard
>>
>>
>> Am Mi., 20. Feb. 2019 um 12:49 Uhr schrieb Adam LeBlanc <
>> alebl...@iol.unh.edu>:
>>
>>> On tcp side it doesn't seg fault anymore but will timeout on some tests
>>> but on the openib side it will still seg fault, here is the output:
>>>
>>> [pandora:19256] *** Process received signal ***
>>> [pandora:19256] Signal: Segmentation fault (11)
>>> [pandora:19256] Signal code: Address not mapped (1)
>>> [pandora:19256] Failing at address: 0x7f911c69fff0
>>> [pandora:19255] *** Process received signal ***
>>> [pandora:19255] Signal: Segmentation fault (11)
>>> [pandora:19255] Signal code: Address not mapped (1)
>>> [pandora:19255] Failing at address: 0x7ff09cd3fff0
>>> [pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680]
>>> [pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0]
>>> [pandora:19256] [ 2]
>>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55]
>>> [pandora:19256] [ 3]
>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b]
>>> [pandora:19256] [ 4] [pandora:19255] [ 0]
>>> /usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680]
>>> [pandora:19255] [ 1]
>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7]
>>> [pandora:19256] [ 5] IMB-MPI1[0x40b83b]
>>> [pandora:19256] [ 6] IMB-MPI1[0x407155]
>>> [pandora:19256] [ 7] IMB-MPI1[0x4022ea]
>>> [pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0]
>>> [pandora:19255] [ 2]
>>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5]
>>> [pandora:19256] [ 9] IMB-MPI1[0x401d49]
>>> [pandora:19256] *** End of error message ***
>>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55]
>>> [pandora:19255] [ 3]
>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b]
>>> [pandora:19255] [ 4]
>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7]
>>> [pandora:19255] [ 5] IMB-MPI1[0x40b83b]
>>> [pandora:19255] [ 6] IMB-MPI1[0x407155]
>>> [pandora:19255] [ 7] IMB-MPI1[0x4022ea]
>>> [pandora:19255] [ 8]
>>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5]
>>> [pandora:19255] [ 9] IMB-MPI1[0x401d49]
>>> [pandora:19255] *** End of error message ***
>>> [phoebe:12418] *** Process received signal ***
>>> [phoebe:12418] Signal: Segmentation fault (11)
>>> [phoebe:12418] Signal code: Address not mapped (1)
>>> [phoebe:12418] Failing at address: 0x7f5ce27dfff0
>>> [phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680]
>>> [phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0]
>>> [phoebe:12418] [ 2]
>>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55]
>>> [phoebe:12418] [ 3]
>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b]
>>> [phoebe:12418] [ 4]
>>> 

Re: [OMPI users] [Request for Cooperation] -- MPI International Survey

2019-02-20 Thread George Bosilca
George,

Thanks for letting us know about this issue, it was a misconfiguration
issue with the form. I guess we did not realized as most of us are
automatically signed in by our browsers. Anyway, thanks for the feedback,
the access to the form should now be completely open.

Sorry for the inconvenience,
  George.



On Wed, Feb 20, 2019 at 2:27 PM George Reeke 
wrote:

> On Wed, 2019-02-20 at 13:21 -0500, George Bosilca wrote:
>
> > To obtain representative samples of the MPI community, we have
> > prepared a survey
> >
> >
> https://docs.google.com/forms/d/e/1FAIpQLSd1bDppVODc8nB0BjIXdqSCO_MuEuNAAbBixl4onTchwSQFwg/viewform
> >
> To access the survey, I was asked to create a google login.
> I do not wish to do this and cannot think of any obvious
> reason why this should be connected to the goals of the
> survey.  Can someone explain the purpose of this or
> hopefully change the survey so no login (to anyplace)
> is required.  I do program with open_mpi and would like
> to participate.
> George Reeke
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] [Request for Cooperation] -- MPI International Survey

2019-02-20 Thread George Bosilca
Dear colleagues,

As part of a wide-ranging effort to understand the current usage of the
Message Passing Interface (MPI) in the development of parallel applications
and to drive future additions to the MPI standard, an international team is
seeking feedback from the largest possible MPI audience (past, current, and
potential users on the globe) to obtain a better understanding of their
needs and to understand the impact of different MPI capabilities on the
development of distributed applications.

To obtain representative samples of the MPI community, we have prepared a
survey

https://docs.google.com/forms/d/e/1FAIpQLSd1bDppVODc8nB0BjIXdqSCO_MuEuNAAbBixl4onTchwSQFwg/viewform

that specifically targets all potential MPI users, including those in the
public, education, research and engineering domains--from undergraduate and
graduate students and postdocs to seasoned researchers and engineers.

The information gathered will be used to publish a comprehensive report of
the different use cases and potential areas of opportunity. These results
will be made freely available, and the raw data, the scripts to manipulate
it, as well as the resulting analysis will be, in time, posted on github [1],
while the curated results will be available via github pages [2].

For anyone interested in participating in the survey, we sincerely
appreciate your feedback.  The survey is rather short (about 30 easy
questions), and should not take more than 15 minutes to complete.  In
addition to your participation, we would appreciate if you re-distribute
this e-mail to your domestic/local communities.

Important Date -- This survey will be closed by the end of February 2019.

Questions? -- Please send any queries about this MPI survey to the sender
of this email.


Thank you on behalf of the International MPI Survey,
George Bosilca (UT/ICL)
Geoffroy Vallee (ORNL)
Emmanuel Jeannot (Inria)
Atsushi Hori (RIKEN)
Takahiro Ogura (RIKEN)

[1] https://github.com/bosilca/MPIsurvey/
[2] https://bosilca.github.io/MPIsurvey/
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Received values is different than sent after Isend() in MPI4py

2019-02-01 Thread George Bosilca
I think the return of ascontiguous will be reused by python before the data
is really transferred by the isend. The input buffer for the isend
operation should be const fire the entire duration of the isend+wait window.

George

On Fri, Feb 1, 2019, 12:27 Konstantinos Konstantinidis  Hi, consider a setup of one MPI sender process unicasting J messages to
> each of N MPI receiver processes, i.e., the total number of transmissions
> is J*N. Each transmission is a block of a matrix, which has been split both
> horizontally and vertically. Each block has been stored as an element of a
> 2D list, i.e. a list of J lists, each of which has N elements. I won't go
> into the details of the Python code (please see attached) but I am pasting
> the problematic part here:
>
> On the sender's level:
> *reqA = [None] * J * N*
>
> *for i in range(J):*
> * for j in range(N):*
> * reqA[i*N+j] = comm.Isend([np.ascontiguousarray(Ahv[i][j//k][j%k]),
> MPI.INT ], dest=j+1, tag=15)*
>
> *MPI.Request.Waitall(reqA)*
>
>
> On the receiver's level:
> *A = []*
> *rA = [None] * J*
> *for i in range(J):*
> * A.append(np.empty_like(np.matrix([[0]*(n//k) for j in range(m//q)])))*
> * rA[i] = comm.Irecv(A[i], source=0, tag=15)*
> *MPI.Request.Waitall(rA)*
>
> *#test*
> *for i in range(J):*
> * print "For job %d, worker %d received splits of A, b" % (i,
> comm.Get_rank()-1)*
> * print(A[i])*
>
> There are some other transmissions/requests in between which can be found
> in the attached file. I do not think they interfere with those, though.
> When I print all of the received blocks, they look like some memory trash
> and none of them is the same as the blocks that were sent. I am using
> Python 2.7.14 and Open MPI 2.1.2 (I have also attached the output of
> ompi_info).
>
> Thanks,
> Kostas Konstantinidis
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI_Reduce_Scatter Segmentation Fault with Intel 2019 Update 1 Compilers on OPA-1

2018-12-04 Thread George Bosilca
I'm trying to replicate using the same compiler (icc 2019) on my OSX over
TCP and shared memory with no luck so far. So either the segfault it's
something specific to OmniPath or to the memcpy implementation used on
Skylake. I tried to use the trace you sent, more specifically the
opal_datatype_copy_content_same_ddt mention, to understand where the
segfault happen, but unfortunately there are 3 calls to
opal_datatype_copy_content_same_ddt in the reduce_scatter algorithm. Can
you please build in debug mode and if you can replicate the segfault send
me the stack trace.

Thanks,
  Geore.


On Tue, Dec 4, 2018 at 5:07 AM Peter Kjellström  wrote:

> On Mon, 3 Dec 2018 19:41:25 +
> "Hammond, Simon David via users"  wrote:
>
> > Hi Open MPI Users,
> >
> > Just wanted to report a bug we have seen with OpenMPI 3.1.3 and 4.0.0
> > when using the Intel 2019 Update 1 compilers on our
> > Skylake/OmniPath-1 cluster. The bug occurs when running the Github
> > master src_c variant of the Intel MPI Benchmarks.
>
> I've noticed this also when using intel mpi (2018 and 2019u1). I
> classified it as a bug in imb but didn't look too deep (new
> reduce_scatter code).
>
> /Peter K
>
> --
> Sent from my Android device with K-9 Mail. Please excuse my
> brevity.___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] [version 2.1.5] invalid memory reference

2018-09-19 Thread George Bosilca
I can't speculate on why you did not notice the memory issue before, simply
because for months we (the developers) didn't noticed and our testing
infrastructure didn't catch this bug despite running millions of tests. The
root cause of the bug was a memory ordering issue, and these are really
tricky to identify.

According to https://github.com/open-mpi/ompi/issues/5638 the patch was
backported to all stable releases starting from 2.1. Until their official
release however you would either need to get a nightly snapshot or test
your luck with master.

  George.


On Wed, Sep 19, 2018 at 3:41 AM Patrick Begou <
patrick.be...@legi.grenoble-inp.fr> wrote:

> Hi George
>
> thanks for your answer. I was previously using OpenMPI 3.1.2 and have also
> this problem. However, using --enable-debug --enable-mem-debug at
> configuration time, I was unable to reproduce the failure and it was quite
> difficult for me do trace the problem. May be I have not run enought tests
> to reach the failure point.
>
> I fall back to  OpenMPI 2.1.5, thinking the problem was in the 3.x
> version. The problem was still there but with the debug config I was able
> to trace the call stack.
>
> Which OpenMPI 3.x version do you suggest ? A nightly snapshot ? Cloning
> the git repo ?
>
> Thanks
>
> Patrick
>
> George Bosilca wrote:
>
> Few days ago we have pushed a fix in master for a strikingly similar
> issue. The patch will eventually make it in the 4.0 and 3.1 but not on the
> 2.x series. The best path forward will be to migrate to a more recent OMPI
> version.
>
> George.
>
>
> On Tue, Sep 18, 2018 at 3:50 AM Patrick Begou <
> patrick.be...@legi.grenoble-inp.fr> wrote:
>
>> Hi
>>
>> I'm moving a large CFD code from Gcc 4.8.5/OpenMPI 1.7.3 to Gcc
>> 7.3.0/OpenMPI 2.1.5 and with this latest config I have random segfaults.
>> Same binary, same server, same number of processes (16), same parameters
>> for the run. Sometimes it runs until the end, sometime I get  'invalid
>> memory reference'.
>>
>> Building the application and OpenMPI in debug mode I saw that this random
>> segfault always occur in collective communications inside OpenMPI. I've no
>> idea howto track this. These are 2 call stack traces (just the openmpi
>> part):
>>
>> *Calling  MPI_ALLREDUCE(...)*
>>
>> Program received signal SIGSEGV: Segmentation fault - invalid memory
>> reference.
>>
>> Backtrace for this error:
>> #0  0x7f01937022ef in ???
>> #1  0x7f0192dd0331 in mca_btl_vader_check_fboxes
>> at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
>> #2  0x7f0192dd0331 in mca_btl_vader_component_progress
>> at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
>> #3  0x7f0192d6b92b in opal_progress
>> at ../../opal/runtime/opal_progress.c:226
>> #4  0x7f0194a8a9a4 in sync_wait_st
>> at ../../opal/threads/wait_sync.h:80
>> #5  0x7f0194a8a9a4 in ompi_request_default_wait_all
>> at ../../ompi/request/req_wait.c:221
>> #6  0x7f0194af1936 in ompi_coll_base_allreduce_intra_recursivedoubling
>> at ../../../../ompi/mca/coll/base/coll_base_allreduce.c:225
>> #7  0x7f0194aa0a0a in PMPI_Allreduce
>> at
>> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pallreduce.c:107
>> #8  0x7f0194f2e2ba in ompi_allreduce_f
>> at
>> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pallreduce_f.c:87
>> #9  0x8e21fd in __linear_solver_deflation_m_MOD_solve_el_grp_pcg
>> at linear_solver_deflation_m.f90:341
>>
>>
>> *Calling MPI_WAITALL()*
>>
>> Program received signal SIGSEGV: Segmentation fault - invalid memory
>> reference.
>>
>> Backtrace for this error:
>> #0  0x7fda5a8d72ef in ???
>> #1  0x7fda59fa5331 in mca_btl_vader_check_fboxes
>> at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
>> #2  0x7fda59fa5331 in mca_btl_vader_component_progress
>> at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
>> #3  0x7fda59f4092b in opal_progress
>> at ../../opal/runtime/opal_progress.c:226
>> #4  0x7fda5bc5f9a4 in sync_wait_st
>> at ../../opal/threads/wait_sync.h:80
>> #5  0x7fda5bc5f9a4 in ompi_request_default_wait_all
>> at ../../ompi/request/req_wait.c:221
>> #6  0x7fda5bca329e in PMPI_Waitall
>> at
>> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pwaitall.c:76
>> #7  0x7fda5c10bc00 in ompi_waitall_f
>> at
>> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104
>> #8  0x6dcbf7 in __data_comm_m_MOD_updat

Re: [OMPI users] [version 2.1.5] invalid memory reference

2018-09-18 Thread George Bosilca
Few days ago we have pushed a fix in master for a strikingly similar issue.
The patch will eventually make it in the 4.0 and 3.1 but not on the 2.x
series. The best path forward will be to migrate to a more recent OMPI
version.

George.


On Tue, Sep 18, 2018 at 3:50 AM Patrick Begou <
patrick.be...@legi.grenoble-inp.fr> wrote:

> Hi
>
> I'm moving a large CFD code from Gcc 4.8.5/OpenMPI 1.7.3 to Gcc
> 7.3.0/OpenMPI 2.1.5 and with this latest config I have random segfaults.
> Same binary, same server, same number of processes (16), same parameters
> for the run. Sometimes it runs until the end, sometime I get  'invalid
> memory reference'.
>
> Building the application and OpenMPI in debug mode I saw that this random
> segfault always occur in collective communications inside OpenMPI. I've no
> idea howto track this. These are 2 call stack traces (just the openmpi
> part):
>
> *Calling  MPI_ALLREDUCE(...)*
>
> Program received signal SIGSEGV: Segmentation fault - invalid memory
> reference.
>
> Backtrace for this error:
> #0  0x7f01937022ef in ???
> #1  0x7f0192dd0331 in mca_btl_vader_check_fboxes
> at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
> #2  0x7f0192dd0331 in mca_btl_vader_component_progress
> at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
> #3  0x7f0192d6b92b in opal_progress
> at ../../opal/runtime/opal_progress.c:226
> #4  0x7f0194a8a9a4 in sync_wait_st
> at ../../opal/threads/wait_sync.h:80
> #5  0x7f0194a8a9a4 in ompi_request_default_wait_all
> at ../../ompi/request/req_wait.c:221
> #6  0x7f0194af1936 in ompi_coll_base_allreduce_intra_recursivedoubling
> at ../../../../ompi/mca/coll/base/coll_base_allreduce.c:225
> #7  0x7f0194aa0a0a in PMPI_Allreduce
> at
> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pallreduce.c:107
> #8  0x7f0194f2e2ba in ompi_allreduce_f
> at
> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pallreduce_f.c:87
> #9  0x8e21fd in __linear_solver_deflation_m_MOD_solve_el_grp_pcg
> at linear_solver_deflation_m.f90:341
>
>
> *Calling MPI_WAITALL()*
>
> Program received signal SIGSEGV: Segmentation fault - invalid memory
> reference.
>
> Backtrace for this error:
> #0  0x7fda5a8d72ef in ???
> #1  0x7fda59fa5331 in mca_btl_vader_check_fboxes
> at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
> #2  0x7fda59fa5331 in mca_btl_vader_component_progress
> at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
> #3  0x7fda59f4092b in opal_progress
> at ../../opal/runtime/opal_progress.c:226
> #4  0x7fda5bc5f9a4 in sync_wait_st
> at ../../opal/threads/wait_sync.h:80
> #5  0x7fda5bc5f9a4 in ompi_request_default_wait_all
> at ../../ompi/request/req_wait.c:221
> #6  0x7fda5bca329e in PMPI_Waitall
> at
> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pwaitall.c:76
> #7  0x7fda5c10bc00 in ompi_waitall_f
> at
> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104
> #8  0x6dcbf7 in __data_comm_m_MOD_update_ghost_ext_comm_r1
> at data_comm_m.f90:5849
>
>
> The segfault is alway located in opal/mca/btl/vader/btl_vader_fbox.h at
> 207/* call the registered callback function */
> 208   reg->cbfunc(_btl_vader.super, hdr.data.tag, ,
> reg->cbdata);
>
>
> OpenMPI 2.1.5 is build with:
> CFLAGS="-O3 -march=native -mtune=native" CXXFLAGS="-O3 -march=native
> -mtune=native" FCFLAGS="-O3 -march=native -mtune=native" \
> ../configure --prefix=$DESTMPI  --enable-mpirun-prefix-by-default
> --disable-dlopen \
> --enable-mca-no-build=openib --without-verbs --enable-mpi-cxx
> --without-slurm --enable-mpi-thread-multiple  --enable-debug
> --enable-mem-debug
>
> Any help appreciated
>
> Patrick
>
> --
> ===
> |  Equipe M.O.S.T. |  |
> |  Patrick BEGOU   | mailto:patrick.be...@grenoble-inp.fr 
>  |
> |  LEGI|  |
> |  BP 53 X | Tel 04 76 82 51 35   |
> |  38041 GRENOBLE CEDEX| Fax 04 76 82 52 71   |
> ===
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] know which CPU has the maximum value

2018-08-10 Thread George Bosilca
You will need to create a special variable that holds 2 entries, one for
the max operation (with whatever type you need) and an int for the rank of
the process. The MAXLOC is described on the OMPI man page [1] and you can
find an example on how to use it on the MPI Forum [2].

George.


[1] https://www.open-mpi.org/doc/v2.0/man3/MPI_Reduce.3.php
[2] https://www.mpi-forum.org/docs/mpi-1.1/mpi-11-html/node79.html

On Fri, Aug 10, 2018 at 11:25 AM Diego Avesani 
wrote:

>  Dear all,
> I have probably understood.
> The trick is to use a real vector and to memorize also the rank.
>
> Have I understood correctly?
> thanks
>
> Diego
>
>
> On 10 August 2018 at 17:19, Diego Avesani  wrote:
>
>> Deal all,
>> I do not understand how MPI_MINLOC works. it seem locate the maximum in a
>> vector and not the CPU to which the valur belongs to.
>>
>> @ray: and if two has the same value?
>>
>> thanks
>>
>>
>> Diego
>>
>>
>> On 10 August 2018 at 17:03, Ray Sheppard  wrote:
>>
>>> As a dumb scientist, I would just bcast the value I get back to the
>>> group and ask whoever owns it to kindly reply back with its rank.
>>>  Ray
>>>
>>>
>>> On 8/10/2018 10:49 AM, Reuti wrote:
>>>
 Hi,

 Am 10.08.2018 um 16:39 schrieb Diego Avesani :
>
> Dear all,
>
> I have a problem:
> In my parallel program each CPU compute a value, let's say eff.
>
> First of all, I would like to know the maximum value. This for me is
> quite simple,
> I apply the following:
>
> CALL MPI_ALLREDUCE(eff, effmaxWorld, 1, MPI_DOUBLE_PRECISION, MPI_MAX,
> MPI_MASTER_COMM, MPIworld%iErr)
>
 Would MPI_MAXLOC be sufficient?

 -- Reuti


 However, I would like also to know to which CPU that value belongs. Is
> it possible?
>
> I have set-up a strange procedure but it works only when all the CPUs
> has different values but fails when two of then has the same eff value.
>
> Is there any intrinsic MPI procedure?
> in anternative,
> do you have some idea?
>
> really, really thanks.
> Diego
>
>
> Diego
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
 ___
 users mailing list
 users@lists.open-mpi.org
 https://lists.open-mpi.org/mailman/listinfo/users

>>>
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>>
>>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI_Comm_get_attr fails for sub-communicators created by MPI_Comm_split

2018-07-08 Thread George Bosilca
Yes, this is the behavior defined by the MPI standard. More precisely,
section 8.1.2 of the MPI 3.1 standard clearly states that the predefined
attributes only exists for MPI_COMM_WORLD.

  George.



On Sun, Jul 8, 2018 at 1:55 AM Weiqun Zhang  wrote:

> Hi,
>
> It appears that MPI_Comm_get_attr fails to get MPI_TAG_UB for
> sub-communicators created by MPI_Comm_split.  I have tested the following
> programs with openmpi 3.1.1 and openmpi 1.10.2.  MPI_Comm_get_attr works on
> MPI_COMM_WORLD, but not on sub-communicators.  Is this expected?
>
> #include 
> #include 
>
> int main(int argc, char** argv)
> {
> MPI_Init(, );
>
> int rank;
> MPI_Comm_rank(MPI_COMM_WORLD, );
> void* p; int flag;
> MPI_Comm_get_attr(MPI_COMM_WORLD, MPI_TAG_UB, , );
> if (!flag) {
> printf("MPI_COMM_WORLD: rank %d failed to get MPI_TAG_UB\n", rank);
> } else {
> printf("MPI_COMM_WORLD: rank %d succeeded to get MPI_TAG_UB\n",
> rank);
> }
>
> int color = rank%2;
> MPI_Comm comm;
> MPI_Comm_split(MPI_COMM_WORLD, color, rank, );
> int rank2;
> MPI_Comm_rank(comm, );
> MPI_Comm_get_attr(comm, MPI_TAG_UB, , );
> if (!flag) {
> printf("Subcommunicator %d rank %d failed to get MPI_TAG_UB\n",
> color, rank2);
> } else {
> printf("Subcommunicator %d rank %d succeeded to get MPI_TAG_UB\n",
> color, rank2);
> }
>
> MPI_Finalize();
> }
>
> program main
>
>   use mpi
>   implicit none
>
>   integer :: ierr, rank, color, comm, rank2
>   integer(kind=MPI_ADDRESS_KIND) :: attrval
>   logical :: flag
>
>   call MPI_Init(ierr)
>
>   call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
>   call MPI_Comm_get_attr(MPI_COMM_WORLD, MPI_TAG_UB, attrval, flag, ierr)
>   if (.not.flag) then
>  print *, "MPI_COMM_WORLD: rank ", rank, " failed to get MPI_TAG_UB."
>   else
>  print *, "MPI_COMM_WORLD: rank ", rank, " succeeded to get
> MPI_TAG_UB."
>   end if
>
>   color = modulo(rank,2)
>   call MPI_Comm_split(MPI_COMM_WORLD, color, rank, comm, ierr)
>   call MPI_Comm_rank(comm, rank2, ierr)
>   call MPI_Comm_get_attr(comm, MPI_TAG_UB, attrval, flag, ierr)
>   if (.not.flag) then
>  print *, "Subcommunicator ", color, " rank ", rank2, " failed to get
> MPI_TAG_UB."
>   else
>  print *, "Subcommunicator ", color, " rank ", rank2, " succeeded to
> get MPI_TAG_UB."
>   end if
>
>   call MPI_Finalize(ierr)
>
> end program main
>
> Thanks,
>
> Weiqun
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI Windows: performance of local memory access

2018-05-23 Thread George Bosilca
We had a similar issue few months back. After investigation it turned out
to be related to NUMA balancing [1] being enabled by default on recent
releases of Linux-based OSes.

In our case turning off NUMA balancing fixed most of the performance
incoherences we had. You can check its status in /proc/sys/kernel/numa_
balancing.

  George.




On Wed, May 23, 2018 at 11:16 AM, Jeff Hammond 
wrote:

> This is very interesting.  Thanks for providing a test code.  I have two
> suggestions for understanding this better.
>
> 1) Use MPI_Win_allocate_shared instead and measure the difference with and
> without alloc_shared_noncontig.  I think this info is not available for
> MPI_Win_allocate because MPI_Win_shared_query is not permitted on
> MPI_Win_allocate windows.  This is a flaw in MPI-3 that I would like to see
> fixed (https://github.com/mpi-forum/mpi-issues/issues/23).
>
> 2) Extend your test to allocate with mmap and measure with various sets of
> map flags (http://man7.org/linux/man-pages/man2/mmap.2.html).  Starting
> with MAP_SHARED and MAP_PRIVATE is the right place to start.  This
> experiment should make the cause unambiguous.
>
> Most likely, this is due to shared vs private mapping, but there is likely
> a tradeoff w.r.t. RMA performance.  It depends on your network and how the
> MPI implementation uses it, but MPI_Win_create_dynamic likely leads to much
> worse RMA performance than MPI_Win_allocate.  MPI_Win_create with a
> malloc'd buffer may perform worse than MPI_Win_allocate for internode RMA
> if the MPI implementation is lazy and doesn't cache page registration in
> MPI_Win_create.
>
> Jeff
>
> On Wed, May 23, 2018 at 3:45 AM, Joseph Schuchart 
> wrote:
>
>> All,
>>
>> We are observing some strange/interesting performance issues in accessing
>> memory that has been allocated through MPI_Win_allocate. I am attaching our
>> test case, which allocates memory for 100M integer values on each process
>> both through malloc and MPI_Win_allocate and writes to the local ranges
>> sequentially.
>>
>> On different systems (incl. SuperMUC and a Bull Cluster), we see that
>> accessing the memory allocated through MPI is significantly slower than
>> accessing the malloc'ed memory if multiple processes run on a single node,
>> increasing the effect with increasing number of processes per node. As an
>> example, running 24 processes per node with the example attached we see the
>> operations on the malloc'ed memory to take ~0.4s while the MPI allocated
>> memory takes up to 10s.
>>
>> After some experiments, I think there are two factors involved:
>>
>> 1) Initialization: it appears that the first iteration is significantly
>> slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a
>> single socket). Excluding the first iteration from the timing or memsetting
>> the range leads to comparable performance. I assume that this is due to
>> page faults that stem from first accessing the mmap'ed memory that backs
>> the shared memory used in the window. The effect of presetting the
>> malloc'ed memory seems smaller (0.4s vs 0.6s).
>>
>> 2) NUMA effects: Given proper initialization, running on two sockets
>> still leads to fluctuating performance degradation under the MPI window
>> memory, which ranges up to 20x (in extreme cases). The performance of
>> accessing the malloc'ed memory is rather stable. The difference seems to
>> get smaller (but does not disappear) with increasing number of repetitions.
>> I am not sure what causes these effects as each process should first-touch
>> their local memory.
>>
>> Are these known issues? Does anyone have any thoughts on my analysis?
>>
>> It is problematic for us that replacing local memory allocation with MPI
>> memory allocation leads to performance degradation as we rely on this
>> mechanism in our distributed data structures. While we can ensure proper
>> initialization of the memory to mitigate 1) for performance measurements, I
>> don't see a way to control the NUMA effects. If there is one I'd be happy
>> about any hints :)
>>
>> I should note that we also tested MPICH-based implementations, which
>> showed similar effects (as they also mmap their window memory). Not
>> surprisingly, using MPI_Alloc_mem and attaching that memory to a dynamic
>> window does not cause these effects while using shared memory windows does.
>> I ran my experiments using Open MPI 3.1.0 with the following command lines:
>>
>> - 12 cores / 1 socket:
>> mpirun -n 12 --bind-to socket --map-by ppr:12:socket
>> - 24 cores / 2 sockets:
>> mpirun -n 24 --bind-to socket
>>
>> and verified the binding using  --report-bindings.
>>
>> Any help or comment would be much appreciated.
>>
>> Cheers
>> Joseph
>>
>> --
>> Dipl.-Inf. Joseph Schuchart
>> High Performance Computing Center Stuttgart (HLRS)
>> Nobelstr. 19
>> D-70569 Stuttgart
>>
>> Tel.: +49(0)711-68565890
>> Fax: +49(0)711-6856832
>> E-Mail: schuch...@hlrs.de
>>
>> 

Re: [OMPI users] peformance abnormality with openib and tcp framework

2018-05-14 Thread George Bosilca
Shared memory communication is important for multi-core platforms,
especially when you have multiple processes per node. But this is only part
of your issue here.

You haven't specified how your processes will be mapped on your resources.
As a result rank 0 and 1 will be on the same node, so you are testing the
shared memory support of whatever BTL you allow. In this case the
performance will be much better for TCP than for IB, simply because you are
not using your network, but its capacity to move data across memory banks.
In such an environment, TCP translated to a memcpy plus a system call,
which is much faster than IB. That being said, it should not matter because
shared memory is there to cover this case.

Add "--map-by node" to your mpirun command to measure the bandwidth between
nodes.

  George.



On Mon, May 14, 2018 at 5:04 AM, Blade Shieh  wrote:

>
> Hi, Nathan:
> Thanks for you reply.
> 1) It was my mistake not to notice usage of osu_latency. Now it worked
> well, but still poorer in openib.
> 2) I did not use sm or vader because I wanted to check performance between
> tcp and openib. Besides, I will run the application in cluster, so vader is
> not so important.
> 3) Of course, I tried you suggestions. I used ^tcp/^openib and set
> btl_openib_if_include to mlx5_0 in a two-node cluster (IB
> direcet-connected).  The result did not change -- IB still better in MPI
> benchmark but poorer in my applicaion.
>
> Best Regards,
> Xie Bin
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpi send/recv pair hangin

2018-04-09 Thread George Bosilca
Noam,

I have few questions for you. According to your original email you are
using OMPI 3.0.1 (but the hang can also be reproduced with the 3.0.0). Also
according to your stacktrace I assume it is an x86_64, compiled with icc.

Is your application multithreaded ? How did you initialized MPI (which
level of threading) ? Can you send us the opal_config.h file please.

Thanks,
  George.




On Sun, Apr 8, 2018 at 8:30 PM, George Bosilca <bosi...@icl.utk.edu> wrote:

> Right, it has nothing to do with the tag. The sequence number is an
> internal counter that help OMPI to deliver the messages in the MPI required
> order (FIFO ordering per communicator per peer).
>
> Thanks for offering your help to debug this issue. We'll need to figure
> out how this can happen, and we will get back to you for further debugging.
>
>   George.
>
>
>
> On Sun, Apr 8, 2018 at 6:00 PM, Noam Bernstein <
> noam.bernst...@nrl.navy.mil> wrote:
>
>> On Apr 8, 2018, at 3:58 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>
>> Noam,
>>
>> Thanks for your output, it highlight an usual outcome. It shows that a
>> process (29662) has pending messages from other processes that are
>> tagged with a past sequence number, something that should have not
>> happened. The only way to get that is if somehow we screwed-up the sending
>> part and push the same sequence number twice ...
>>
>> More digging is required.
>>
>>
>> OK - these sequence numbers are unrelated to the send/recv tags, right?
>> I’m happy to do any further debugging.  I can’t share code, since we do
>> have access but it’s not open source, but I’d be happy to test out anything
>> you can suggest.
>>
>> thanks,
>> Noam
>>
>> 
>> |
>> |
>> |
>> *U.S. NAVAL*
>> |
>> |
>> _*RESEARCH*_
>> |
>> LABORATORY
>>
>> Noam Bernstein, Ph.D.
>> Center for Materials Physics and Technology
>> U.S. Naval Research Laboratory
>> T +1 202 404 8628  F +1 202 404 7546
>> https://www.nrl.navy.mil
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpi send/recv pair hangin

2018-04-08 Thread George Bosilca
Right, it has nothing to do with the tag. The sequence number is an
internal counter that help OMPI to deliver the messages in the MPI required
order (FIFO ordering per communicator per peer).

Thanks for offering your help to debug this issue. We'll need to figure out
how this can happen, and we will get back to you for further debugging.

  George.



On Sun, Apr 8, 2018 at 6:00 PM, Noam Bernstein <noam.bernst...@nrl.navy.mil>
wrote:

> On Apr 8, 2018, at 3:58 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>
> Noam,
>
> Thanks for your output, it highlight an usual outcome. It shows that a
> process (29662) has pending messages from other processes that are tagged
> with a past sequence number, something that should have not happened. The
> only way to get that is if somehow we screwed-up the sending part and push
> the same sequence number twice ...
>
> More digging is required.
>
>
> OK - these sequence numbers are unrelated to the send/recv tags, right?
> I’m happy to do any further debugging.  I can’t share code, since we do
> have access but it’s not open source, but I’d be happy to test out anything
> you can suggest.
>
> thanks,
> Noam
>
> 
> |
> |
> |
> *U.S. NAVAL*
> |
> |
> _*RESEARCH*_
> |
> LABORATORY
>
> Noam Bernstein, Ph.D.
> Center for Materials Physics and Technology
> U.S. Naval Research Laboratory
> T +1 202 404 8628  F +1 202 404 7546
> https://www.nrl.navy.mil
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpi send/recv pair hangin

2018-04-08 Thread George Bosilca
Noam,

Thanks for your output, it highlight an usual outcome. It shows that a
process (29662) has pending messages from other processes that are tagged
with a past sequence number, something that should have not happened. The
only way to get that is if somehow we screwed-up the sending part and push
the same sequence number twice ...

More digging is required.

  George.



On Fri, Apr 6, 2018 at 2:42 PM, Noam Bernstein <noam.bernst...@nrl.navy.mil>
wrote:

>
> On Apr 6, 2018, at 1:41 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>
> Noam,
>
> According to your stack trace the correct way to call the mca_pml_ob1_dump
> is with the communicator from the PMPI call. Thus, this call was successful:
>
> (gdb) call mca_pml_ob1_dump(0xed932d0, 1)
> $1 = 0
>
>
> I should have been more clear, the output is not on gdb but on the output
> stream of your application. If you run your application by hand with
> mpirun, the output should be on the terminal where you started mpirun. If
> you start your job with a batch schedule, the output should be in the
> output file associated with your job.
>
>
> OK, that makes sense.  Here’s what I get from the two relevant processes.
>  compute-1-9 should be receiving, and 1-10 sending, I believe.  Is it
> possible that the fact that all send send/recv pairs (nodes 1-3 on each set
> of 4 sending to 0, which is receiving from each one in turn) are using the
> same tag (200) is confusing things?
>
> [compute-1-9:29662] Communicator MPI COMMUNICATOR 5 SPLIT FROM 3
> [0xeba14d0](5) rank 0 recv_seq 8855 num_procs 4 last_probed 0
> [compute-1-9:29662] [Rank 1] expected_seq 175 ompi_proc 0xeb0ec50 send_seq
> 8941
> [compute-1-9:29662] [Rank 2] expected_seq 127 ompi_proc 0xeb97200 send_seq
> 385
> [compute-1-9:29662] unexpected frag
> [compute-1-9:29662] hdr RNDV [   ] ctx 5 src 2 tag 200 seq 126
> msg_length 86777600
> [compute-1-9:29662] [Rank 3] expected_seq 8558 ompi_proc 0x2b8ee8000f90
> send_seq 5
> [compute-1-9:29662] unexpected frag
> [compute-1-9:29662] hdr RNDV [   ] ctx 5 src 3 tag 200 seq 8557
> msg_length 86777600
>
> [compute-1-10:15673] Communicator MPI COMMUNICATOR 5 SPLIT FROM 3
> [0xe9cc6a0](5) rank 1 recv_seq 9119 num_procs 4 last_probed 0
> [compute-1-10:15673] [Rank 0] expected_seq 8942 ompi_proc 0xe8e1db0
> send_seq 174
> [compute-1-10:15673] [Rank 2] expected_seq 54 ompi_proc 0xe9d7940 send_seq
> 8561
> [compute-1-10:15673] [Rank 3] expected_seq 126 ompi_proc 0xe9c20c0
> send_seq 385
>
>
> 
> |
> |
> |
> *U.S. NAVAL*
> |
> |
> _*RESEARCH*_
> |
> LABORATORY
>
> Noam Bernstein, Ph.D.
> Center for Materials Physics and Technology
> U.S. Naval Research Laboratory
> T +1 202 404 8628  F +1 202 404 7546
> https://www.nrl.navy.mil
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpi send/recv pair hangin

2018-04-06 Thread George Bosilca
Noam,

According to your stack trace the correct way to call the mca_pml_ob1_dump
is with the communicator from the PMPI call. Thus, this call was successful:

(gdb) call mca_pml_ob1_dump(0xed932d0, 1)
$1 = 0


I should have been more clear, the output is not on gdb but on the output
stream of your application. If you run your application by hand with
mpirun, the output should be on the terminal where you started mpirun. If
you start your job with a batch schedule, the output should be in the
output file associated with your job.

  George.



On Fri, Apr 6, 2018 at 12:53 PM, Noam Bernstein <noam.bernst...@nrl.navy.mil
> wrote:

> On Apr 5, 2018, at 4:11 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>
> I attach with gdb on the processes and do a "call mca_pml_ob1_dump(comm,
> 1)". This allows the debugger to make a call our function, and output
> internal information about the library status.
>
>
> OK - after a number of missteps, I recompiled openmpi with debugging mode
> active, reran the executable (didn’t recompile, just using the new
> library), and got the comm pointer by attaching to the process and looking
> at the stack trace:
>
> #0  0x2b8a7599c42b in ibv_poll_cq (cq=0xec66010, num_entries=256,
> wc=0x7ffdea76d680) at /usr/include/infiniband/verbs.h:1272
> #1  0x2b8a759a8194 in poll_device (device=0xebc5300, count=0) at
> btl_openib_component.c:3608
> #2  0x2b8a759a871f in progress_one_device (device=0xebc5300) at
> btl_openib_component.c:3741
> #3  0x2b8a759a87be in btl_openib_component_progress () at
> btl_openib_component.c:3765
> #4  0x2b8a64b9da42 in opal_progress () at runtime/opal_progress.c:222
> #5  0x2b8a76c2c199 in ompi_request_wait_completion (req=0xec22600) at
> ../../../../ompi/request/request.h:392
> #6  0x2b8a76c2d642 in mca_pml_ob1_recv (addr=0x2b8a8a99bf20,
> count=5423600, datatype=0x2b8a64832b80, src=1, tag=200, comm=0xed932d0,
> status=0x385dd90) at pml_ob1_irecv.c:135
> #7  0x2b8a6454c857 in PMPI_Recv (buf=0x2b8a8a99bf20, count=5423600,
> type=0x2b8a64832b80, source=1, tag=200, comm=0xed932d0, status=0x385dd90)
> at precv.c:79
> #8  0x2b8a6428ca7c in ompi_recv_f (buf=0x2b8a8a99bf20
> "DB»\373\v{\277\204\333\336\306[B\205\277\030ҀҶ\250v\277\
> 225\377qW\001\251w?\240\020\202&=)S\277\202+\214\067\224\345R?\272\221Co\236\206\217?",
> count=0x7ffdea770eb4, datatype=0x2d43bec, source=0x7ffdea770a38,
> tag=0x2d43bf0, comm=0x5d30a68, status=0x385dd90, ierr=0x7ffdea770a3c)
> at precv_f.c:85
> #9  0x0042887b in m_recv_z (comm=..., node=-858993460, zvec=Cannot
> access memory at address 0x2d
> ) at mpi.F:680
> #10 0x0123e0f1 in fileio::outwav (io=..., wdes=..., w=Cannot
> access memory at address 0x2d
> ) at fileio.F:952
> #11 0x02abfd8f in vamp () at main.F:4204
> #12 0x004139de in main ()
> #13 0x003f0c81ed1d in __libc_start_main () from /lib64/libc.so.6
> #14 0x004138e9 in _start ()
>
>
> The comm value is different in omp_recv_f and things below, so I tried
> both.   With the value of the lower level functions I get nothing useful
>
> (gdb) call mca_pml_ob1_dump(0xed932d0, 1)
> $1 = 0
>
> and the value from omp_recv_f I get a seg fault:
>
> (gdb) call mca_pml_ob1_dump(0x5d30a68, 1)
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x2b8a76c26d0d in mca_pml_ob1_dump (comm=0x5d30a68, verbose=1) at
> pml_ob1.c:577
> 577opal_output(0, "Communicator %s [%p](%d) rank %d recv_seq %d
> num_procs %lu last_probed %lu\n",
> The program being debugged was signaled while in a function called from
> GDB.
> GDB remains in the frame where the signal was received.
> To change this behavior use "set unwindonsignal on".
> Evaluation of the expression containing the function
> (mca_pml_ob1_dump) will be abandoned.
> When the function is done executing, GDB will silently stop.
>
> Should this have worked, or am I doing something wrong?
>
> thanks,
> Noam
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread George Bosilca
Yes, you can do this by adding --enable-debug to OMPI configure (and make
sure your don't have the configure flag --with-platform=optimize).

  George.


On Thu, Apr 5, 2018 at 4:20 PM, Noam Bernstein <noam.bernst...@nrl.navy.mil>
wrote:

>
> On Apr 5, 2018, at 4:11 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>
> I attach with gdb on the processes and do a "call mca_pml_ob1_dump(comm,
> 1)". This allows the debugger to make a call our function, and output
> internal information about the library status.
>
>
> Great.  But I guess I need to recompile ompi in debug mode?  Is that just
> a flag to configure?
>
> thanks,
> Noam
>
>
> 
> |
> |
> |
> *U.S. NAVAL*
> |
> |
> _*RESEARCH*_
> |
> LABORATORY
>
> Noam Bernstein, Ph.D.
> Center for Materials Physics and Technology
> U.S. Naval Research Laboratory
> T +1 202 404 8628  F +1 202 404 7546
> https://www.nrl.navy.mil
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread George Bosilca
I attach with gdb on the processes and do a "call mca_pml_ob1_dump(comm,
1)". This allows the debugger to make a call our function, and output
internal information about the library status.

  George.



On Thu, Apr 5, 2018 at 4:03 PM, Noam Bernstein <noam.bernst...@nrl.navy.mil>
wrote:

> On Apr 5, 2018, at 3:55 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>
> Noam,
>
> The OB1 provide a mechanism to dump all pending communications in a
> particular communicator. To do this I usually call mca_pml_ob1_dump(comm,
> 1), with comm being the MPI_Comm and 1 being the verbose mode. I have no
> idea how you can find the pointer to the communicator out of your code, but
> if you compile OMPI in debug mode you will see it as an argument to the 
> mca_pml_ob1_send
> and mca_pml_ob1_recv function.
>
> This information will give us a better idea on what happened to the
> message, where is has been sent (or not), and what were the source and tag
> used for the matching.
>
>
> Interesting.  How would you do this in a hung program?  Call it before you
> call the things that you expect will hang?  And any ideas how to get a
> communicator pointer from fortran?
>
> Noam
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread George Bosilca
Noam,

The OB1 provide a mechanism to dump all pending communications in a
particular communicator. To do this I usually call mca_pml_ob1_dump(comm,
1), with comm being the MPI_Comm and 1 being the verbose mode. I have no
idea how you can find the pointer to the communicator out of your code, but
if you compile OMPI in debug mode you will see it as an argument to
the mca_pml_ob1_send
and mca_pml_ob1_recv function.

This information will give us a better idea on what happened to the
message, where is has been sent (or not), and what were the source and tag
used for the matching.

  George.



On Thu, Apr 5, 2018 at 12:01 PM, Edgar Gabriel 
wrote:

> is the file I/O that you mentioned using MPI I/O for that? If yes, what
> file system are you writing to?
>
> Edgar
>
>
>
> On 4/5/2018 10:15 AM, Noam Bernstein wrote:
>
>> On Apr 5, 2018, at 11:03 AM, Reuti  wrote:
>>>
>>> Hi,
>>>
>>> Am 05.04.2018 um 16:16 schrieb Noam Bernstein <
 noam.bernst...@nrl.navy.mil>:

 Hi all - I have a code that uses MPI (vasp), and it’s hanging in a
 strange way.  Basically, there’s a Cartesian communicator, 4x16 (64
 processes total), and despite the fact that the communication pattern is
 rather regular, one particular send/recv pair hangs consistently.
 Basically, across each row of 4, task 0 receives from 1,2,3, and tasks
 1,2,3 send to 0.  On most of the 16 such sets all those send/recv pairs
 complete.  However, on 2 of them, it hangs (both the send and recv).  I
 have stack traces (with gdb -p on the running processes) from what I
 believe are corresponding send/recv pairs.

 

 This is with OpenMPI 3.0.1 (same for 3.0.0, haven’t checked older
 versions), Intel compilers (17.2.174). It seems to be independent of which
 nodes, always happens on this pair of calls and happens after the code has
 been running for a while, and the same code for the other 14 sets of 4 work
 fine, suggesting that it’s an MPI issue, rather than an obvious bug in this
 code or a hardware problem.  Does anyone have any ideas, either about
 possible causes or how to debug things further?

>>> Do you use scaLAPACK, and which type of BLAS/LAPACK? I used Intel MKL
>>> with the Intel compilers for VASP and found, that using in addition a
>>> self-compiled scaLAPACK is working fine in combination with Open MPI. Using
>>> Intel scaLAPACK and Intel MPI is also working fine. What I never got
>>> working was the combination Intel scaLAPACK and Open MPI – at one point one
>>> process got a message from a wrong rank IIRC. I tried both: the Intel
>>> supplied Open MPI version of scaLAPACK and also compiling the necessary
>>> interface on my own for Open MPI in $MKLROOT/interfaces/mklmpi with
>>> identical results.
>>>
>> MKL BLAS/LAPACK, with my own self-compiled scalapack, but in this run I
>> set LSCALAPCK=.FALSE. I suppose I could try compiling without it just to
>> test.  In any case, this is when it’s writing out the wavefunctions, which
>> I would assume be unrelated to scalapack operations (unless they’re
>> corrupting some low level MPI thing, I guess).
>>
>>
>>   Noam
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] running mpi program between my PC and an ARM-architektur raspberry

2018-04-04 Thread George Bosilca
We can always build complicated solutions, but in some cases sane and
simple solutions exists. Let me clear some of the misinformation in this
thread.

The MPI standard is clear what type of conversion is allowed and how it
should be done (for more info read Chapter 4): no type conversion is
allowed (don't send a long and expect a short), for everything else
truncation to a sane value is the rule. This is nothing new, the rules are
similar to other data conversion standards such as XDR. Thus, if you send
an MPI_LONG from a machine where long is 8 bytes to an MPI_LONG on a
machine where it is 4 bytes, you will get a valid number when possible,
otherwise [MAX|MIN]_LONG on the target machine. For floating point data the
rules are more complicated due to potential exponent and mantissa length
mismatch, but in general if the data is representable on the target
architecture a sane value is obtained. Otherwise, the data will be replaced
with one of the extremes. This also applies to file operations for as long
as the correct external32 type is used.

The datatype engine in Open MPI supports all these conversions, for as long
as the source and target machine are correctly identified. This
identification is only enabled when OMPI is compiled with support for
heterogeneous architectures.

  George.


On Wed, Apr 4, 2018 at 11:35 AM, George Reeke 
wrote:

> Dear colleagues,
>FWIW, years ago I was looking at this problem and developed my
> own solution (for C programs) with this structure:
> --Be sure your code that works with ambiguous-length types like
> 'long' can handle different sizes.  I have replacement unambiguous
> typedef names like 'si32', 'ui64' etc. for the usual signed and
> unsigned fixed-point numbers.
> --Run your source code through a utility that analyzes a specified
> set of variables, structures, and unions that will be used in
> messages and builds tables giving their included types.  Include
> these tables in your makefiles.
> --Replace malloc, calloc, realloc, free with my own versions,
> where you pass a type argument pointing into to this table along
> with number of items, etc.  There are separate memory pools for
> items that will be passed often, rarely, or never, just to make
> things more efficient.
> --Do all these calls on the rank 0 processor at program startup and
> call a special broadcast routine that sets up data structures on
> all the other processors to manage the conversions.
> --Replace mpi message passing and broadcast calls with new routines
> that use the type information (stored by malloc, calloc, etc.) to
> determine what variables to lengthen or shorten or swap on arrival
> at the destination.  Regular mpi message passing is used inside
> these routines and can be used natively for variables that do not
> ever need length changes or byte swapping (i.e. text).  I have a
> simple set of routines to gather statistics across nodes with sum,
> max, etc. operations, but not too fancy.  I do not have versions of
> any of the mpi operations that collect or distribute matrices, etc.
> --A little routine must be written for every union.  This is called
> from the package when a union is received to determine which
> member is present so the right conversion can be done.
> --There was a hook to handle IBM (hex exponent) vs IEEE floating
> point, but the code never got written.
>Because this is all very complicated and demanding on the
> programmer, I am not making it publicly available, but will be
> glad to send it privately to anyone who really thinks they can
> use it and is willing to get their hands dirty.
>George Reeke (private email: re...@rockefeller.edu)
>
>
>
>
>
>
> On Tue, 2018-04-03 at 23:39 +, Jeff Squyres (jsquyres) wrote:
> > On Apr 2, 2018, at 1:39 PM, dpchoudh .  wrote:
> > >
> > > Sorry for a pedantic follow up:
> > >
> > > Is this (heterogeneous cluster support) something that is specified by
> > > the MPI standard (perhaps as an optional component)?
> >
> > The MPI standard states that if you send a message, you should receive
> the same values at the receiver.  E.g., if you sent int=3, you should
> receive int=3, even if one machine is big endian and the other machine is
> little endian.
> >
> > It does not specify what happens when data sizes are different (e.g., if
> type X is 4 bits on one side and 8 bits on the other) -- there's no good
> answers on what to do there.
> >
> > > Do people know if
> > > MPICH. MVAPICH, Intel MPI etc support it? (I do realize this is an
> > > OpenMPI forum)
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32)

2018-02-09 Thread George Bosilca
What are the settings of the firewall on your 2 nodes ?

  George.



On Fri, Feb 9, 2018 at 3:08 PM, William Mitchell  wrote:

> When I try to run an MPI program on a network with a shared file system
> and connected by ethernet, I get the error message "tcp_peer_send_blocking:
> send() to socket 9 failed: Broken pipe (32)" followed by some suggestions
> of what could cause it, none of which are my problem.  I have searched the
> FAQ, mailing list archives, and googled the error message, with only a few
> hits touching on it, none of which solved the problem.
>
> This is on a Linux CentOS 7 system with Open MPI 1.10.6 and Intel Fortran
> (more detailed system information below).
>
> Here are details on how I encounter the problem:
>
> me@host1> cat hellompi.f90
>program hello
>include 'mpif.h'
>integer rank, size, ierror, nl
>character(len=MPI_MAX_PROCESSOR_NAME) :: hostname
>
>call MPI_INIT(ierror)
>call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
>call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
>call MPI_GET_PROCESSOR_NAME(hostname, nl, ierror)
>print*, 'node', rank, ' of', size, ' on ', hostname(1:nl), ': Hello
> world'
>call MPI_FINALIZE(ierror)
>end
>
> me@host1> mpifort --showme
> ifort -I/usr/include/openmpi-x86_64 -pthread -m64 -I/usr/lib64/openmpi/lib
> -Wl,-rpath -Wl,/usr/lib64/openmpi/lib -Wl,--enable-new-dtags
> -L/usr/lib64/openmpi/lib -lmpi_usempi -lmpi_mpifh -lmpi
>
> me@host1> ifort --version
> ifort (IFORT) 18.0.0 20170811
> Copyright (C) 1985-2017 Intel Corporation.  All rights reserved.
>
> me@host1> mpifort -o hellompi hellompi.f90
>
> [Note: it runs on 1 machine, but not on two]
>
> me@host1> mpirun -np 2 hellompi
>  node   0  of   2  on host1.domain: Hello world
>  node   1  of   2  on host1.domain: Hello world
>
> me@host1> cat hosts
> host2.domain
> host1.domain
>
> me@host1> mpirun -np 2 --hostfile hosts hellompi
> [host2.domain:250313] [[46562,0],1] tcp_peer_send_blocking: send() to
> socket 9 failed: Broken pipe (32)
> --
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> [suggested causes deleted]
>
> Here is system information:
>
> me@host2> cat /etc/redhat-release
> CentOS Linux release 7.4.1708 (Core)
>
> me@host1> uname -a
> Linux host1.domain 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58
> UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> me@host1> rpm -qa | grep openmpi
> mpitests-openmpi-4.1-1.el7.x86_64
> openmpi-1.10.6-2.el7.x86_64
> openmpi-devel-1.10.6-2.el7.x86_64
>
> me@host1> ompi_info --all
> [Results of this command for each host are in the attached files.]
>
> me@host1> ompi_info -v ompi full --parsable
> ompi_info: Error: unknown option "-v"
> [Is the request to run that command given on the Open MPI "Getting Help"
> web page an error?]
>
> me@host1> printenv | grep OMPI
> MPI_COMPILER=openmpi-x86_64
> OMPI_F77=ifort
> OMPI_FC=ifort
> OMPI_MCA_mpi_yield_when_idle=1
> OMPI_MCA_btl=tcp,self
>
> I am using ssh-agent, and I can ssh between the two hosts.  In fact, from
> host1 I can use ssh to request that host2 ssh back to host1:
>
> me@host1> ssh -A host2 "ssh host1 hostname"
> host1.domain
>
> Any suggestions on how to solve this problem are appreciated.
>
> Bill
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] False positives with OpenMPI and memchecker

2018-01-06 Thread George Bosilca
Hi Yvan,

You mention a test. Can you make it available either on the mailing list, a
github issue or privately ?

  Thanks,
George.



On Sat, Jan 6, 2018 at 7:43 PM,  wrote:

>
> Hello,
>
> I obtain false positives with OpenMPI when memcheck is enabled, using
> OpenMPI 3.0.0
>
> This is similar to an issue I had reported and had been fixed in Nov.
> 2016, but affects MPI_Isend/MPI_Irecv instead of MPI_Send/MPI_Recv.
> I had not done much additional testing on my application using memchecker
> since, so probably may have missed remaining issues at the time.
>
> In the attached test (which has 2 optional variants relating to whether
> the send and receive buffers are allocated on the stack or heap, but
> exhibit the same basic issue), I have (running "mpicc vg_ompi_isend_irecv.c
> && -g mpiexec -n 2 ./a.out"):
>
> ==19651== Memcheck, a memory error detector
> ==19651== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
> ==19651== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright
> info
> ==19651== Command: ./a.out
> ==19651==
> ==19650== Thread 3:
> ==19650== Syscall param epoll_pwait(sigmask) points to unaddressable
> byte(s)
> ==19650==at 0x5470596: epoll_pwait (in /usr/lib/libc-2.26.so)
> ==19650==by 0x5A5A9FA: epoll_dispatch (epoll.c:407)
> ==19650==by 0x5A5EA9A: opal_libevent2022_event_base_loop
> (event.c:1630)
> ==19650==by 0x94C96ED: progress_engine (in /home/yvan/opt/openmpi-3.0/
> lib/openmpi/mca_pmix_pmix2x.so)
> ==19650==by 0x5163089: start_thread (in /usr/lib/libpthread-2.26.so)
> ==19650==by 0x547042E: clone (in /usr/lib/libc-2.26.so)
> ==19650==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
> ==19650==
> ==19651== Thread 3:
> ==19651== Syscall param epoll_pwait(sigmask) points to unaddressable
> byte(s)
> ==19651==at 0x5470596: epoll_pwait (in /usr/lib/libc-2.26.so)
> ==19651==by 0x5A5A9FA: epoll_dispatch (epoll.c:407)
> ==19651==by 0x5A5EA9A: opal_libevent2022_event_base_loop
> (event.c:1630)
> ==19651==by 0x94C96ED: progress_engine (in /home/yvan/opt/openmpi-3.0/
> lib/openmpi/mca_pmix_pmix2x.so)
> ==19651==by 0x5163089: start_thread (in /usr/lib/libpthread-2.26.so)
> ==19651==by 0x547042E: clone (in /usr/lib/libc-2.26.so)
> ==19651==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
> ==19651==
> ==19650== Thread 1:
> ==19650== Invalid read of size 2
> ==19650==at 0x4C33BA0: memmove (in /usr/lib/valgrind/vgpreload_
> memcheck-amd64-linux.so)
> ==19650==by 0x5A27C85: opal_convertor_pack (in
> /home/yvan/opt/openmpi-3.0/lib/libopen-pal.so.40.0.0)
> ==19650==by 0xD177EF1: mca_btl_vader_sendi (in
> /home/yvan/opt/openmpi-3.0/lib/openmpi/mca_btl_vader.so)
> ==19650==by 0xE1A7F31: mca_pml_ob1_send_inline.constprop.4 (in
> /home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pml_ob1.so)
> ==19650==by 0xE1A8711: mca_pml_ob1_isend (in
> /home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pml_ob1.so)
> ==19650==by 0x4EB4C83: PMPI_Isend (in /home/yvan/opt/openmpi-3.0/
> lib/libmpi.so.40.0.0)
> ==19650==by 0x108B24: main (vg_ompi_isend_irecv.c:63)
> ==19650==  Address 0x1ffefffcc4 is on thread 1's stack
> ==19650==  in frame #6, created by main (vg_ompi_isend_irecv.c:7)
>
> The first 2 warnings seem to relate to initialization, so are not a big
> issue, but the last one occurs whenever I use MPI_Isend, so they are a more
> important issue.
>
> Using a version built without --enable-memchecker, I also have the two
> initialization warnings, but not the warning from MPI_Isend...
>
> Best regards,
>
>   Yvan Fournier
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Parallel MPI broadcasts (parameterized)

2017-12-11 Thread George Bosilca
g to improve).
>
> Now the transmission time and/or the rate I compute for Allgatherv() is
> wrong since even though I can see that it takes pretty much the same time
> as in the case of Bcast() it prints completely unrealistic numbers as the
> Shuffling total time (very high) and rate (very low). I have attached a
> second file which includes the main parts of the code with the Allgatherv()
> idea. The point is that each slave initializes some total time counter
> "time", some transmission time counter "txTime" and some totalsize counter
> in bytes "tolSize" to zero and then iterates through all groups that it
> belongs to and adds the total time and the transmission time it took for
> the send-receive function to complete (the only difference is that I
> subtract the deserialization time from both counters since I don't want
> this counted in order to have a valid comparison with the previous
> implementation). It also adds the total size of data and metadata it
> transmitted to the group. When all slaves are done they return the total
> Shuffle time and the rate in Megabits/sec to the Master. The Master (code
> omitted) just computes the average of these values (time and rate) and
> prints them on the terminal. I am pretty sure that I miss something here
> and I get wrong measurements.
>
> Thanks for your time:)
>
> On Tue, Nov 7, 2017 at 10:37 PM, George Bosilca <bosi...@icl.utk.edu>
> wrote:
>
>> On Tue, Nov 7, 2017 at 6:09 PM, Konstantinos Konstantinidis <
>> kostas1...@gmail.com> wrote:
>>
>>> OK, I will try to explain a few more things about the shuffling and I
>>> have attached only specific excerpts of the code to avoid confusion. I have
>>> added many comments.
>>>
>>> First, let me note that this project is an implementation of the
>>> Terasort benchmark with a master node which assigns jobs to the slaves and
>>> communicates with them after each phase to get measurements.
>>>
>>> The file shuffle_before.cc shows how I am doing the shuffling up to now
>>> and the shuffle_after.cc the progress I made so far switching to
>>> Allgatherv().
>>>
>>> I have also included the code that measures time and data size since
>>> it's crucial for me to check if I have rate speedup.
>>>
>>> Some questions I have are:
>>> 1. At shuffle_after.cc:61 why do we reserve *comm.Get_size() *entries
>>> for* recv_counts* and not *comm.Get_size()-1 *? For example if I am
>>> rank k what is the point of *recv_counts[k-1]*? I guess that rank k
>>> also receives data from himself but we can ignore it, right?
>>>
>>
>> No, you cant simply ignore it ;) allgather copies the same amount of data
>> to all processes in the communicator ... including itself. If you want to
>> argue about this reach out to the MPI standardization body ;)
>>
>>
>>>
>>> 2. My next concern is about the structure of the buffer *recv_buf[]*.
>>> The documentation says that the data is stored there ordered. So I assume
>>> that it's stored as segments of char* ordered by rank and the way to
>>> distinguish them is to chop the whole data based on *recv_counts[]*. So
>>> let G = {g1, g2, ..., gN} a group that exchanges data. Let's take slave g2:
>>> Then segment *recv_buf[0 until **recv_counts[0]-1**] *is what g2
>>> received from g1, *recv_buf[**recv_counts[0] until **recv_counts[1]-1**]
>>> *is what g2 received from himself (ignore it), and so on... Is this
>>> idea correct?
>>>
>>
>> I don't know what documentation says "ordered", there is no such wording
>> in the MPI standard. By carefully playing with the receive datatype I can
>> do anything I want, including interleaving data from the different peers.
>> But this is not what you are trying to do here.
>>
>> The placement in memory you describe is true if you use the displacement
>> array as crafted in my example. The entry i in the displacement array
>> specifies the displacement (relative to recvbuf) at which to place the
>> incoming data from process i, so where you receive data has nothing to do
>> with the amount you receive but with what you have in the displacement
>> array.
>>
>>
>>>
>>> So I have written a sketch of the code at shuffle_after.cc which I also
>>> try to explain how the master will compute rate, but at least I want to
>>> make it work.
>>>
>>
>> This code looks OK to me. I would however:
>>
>> 1. Remove the barriers on the workerComm. If the order of the
>> communicators 

Re: [OMPI users] Parallel MPI broadcasts (parameterized)

2017-11-07 Thread George Bosilca
On Tue, Nov 7, 2017 at 6:09 PM, Konstantinos Konstantinidis <
kostas1...@gmail.com> wrote:

> OK, I will try to explain a few more things about the shuffling and I have
> attached only specific excerpts of the code to avoid confusion. I have
> added many comments.
>
> First, let me note that this project is an implementation of the Terasort
> benchmark with a master node which assigns jobs to the slaves and
> communicates with them after each phase to get measurements.
>
> The file shuffle_before.cc shows how I am doing the shuffling up to now
> and the shuffle_after.cc the progress I made so far switching to
> Allgatherv().
>
> I have also included the code that measures time and data size since it's
> crucial for me to check if I have rate speedup.
>
> Some questions I have are:
> 1. At shuffle_after.cc:61 why do we reserve *comm.Get_size() *entries for*
> recv_counts* and not *comm.Get_size()-1 *? For example if I am rank k
> what is the point of *recv_counts[k-1]*? I guess that rank k also
> receives data from himself but we can ignore it, right?
>

No, you cant simply ignore it ;) allgather copies the same amount of data
to all processes in the communicator ... including itself. If you want to
argue about this reach out to the MPI standardization body ;)


>
> 2. My next concern is about the structure of the buffer *recv_buf[]*. The
> documentation says that the data is stored there ordered. So I assume that
> it's stored as segments of char* ordered by rank and the way to distinguish
> them is to chop the whole data based on *recv_counts[]*. So let G = {g1,
> g2, ..., gN} a group that exchanges data. Let's take slave g2: Then segment 
> *recv_buf[0
> until **recv_counts[0]-1**] *is what g2 received from g1, 
> *recv_buf[**recv_counts[0]
> until **recv_counts[1]-1**] *is what g2 received from himself (ignore
> it), and so on... Is this idea correct?
>

I don't know what documentation says "ordered", there is no such wording in
the MPI standard. By carefully playing with the receive datatype I can do
anything I want, including interleaving data from the different peers. But
this is not what you are trying to do here.

The placement in memory you describe is true if you use the displacement
array as crafted in my example. The entry i in the displacement array
specifies the displacement (relative to recvbuf) at which to place the
incoming data from process i, so where you receive data has nothing to do
with the amount you receive but with what you have in the displacement
array.


>
> So I have written a sketch of the code at shuffle_after.cc which I also
> try to explain how the master will compute rate, but at least I want to
> make it work.
>

This code looks OK to me. I would however:

1. Remove the barriers on the workerComm. If the order of the communicators
in the multicastGroupMap is identical on all processes (including
communicators where they do not belong to) then the barriers are
superfluous. However, if you try to protect your processes from starting
the allgather collective to early, then you can replace the barrier
on workerComm with one on mcComm.

2. The check "ns.find(rank) != ns.end()" should be equivalent to "mcComm ==
MPI_COMM_NULL" if I understand your code correctly.

3. This is an optimization. Remove all time exchanges outside the main
loop. Instead of sending them one-by-one, keep them in an array and send
the entire array once per CodedWorker::execShuffle, possible via an
MPI_Allgatherv toward the master process in MPI_COMM_WORLD (in this case
you can convert the "long long" into a double to facilitate the collective).

  George.



>
> I know that this discussion is getting long but if you have some free time
> can you take a look at it?
>
> Thanks,
> Kostas
>
>
> On Tue, Nov 7, 2017 at 9:34 AM, George Bosilca <bosi...@icl.utk.edu>
> wrote:
>
>> If each process send a different amount of data, then the operation
>> should be an allgatherv. This also requires that you know the amount each
>> process will send, so you will need a allgather. Schematically the code
>> should look like the following:
>>
>> long bytes_send_count = endata.size * sizeof(long);  // compute the
>> amount of data sent by this process
>> long* recv_counts = (long*)malloc(comm_size * sizeof(long));  // allocate
>> buffer to receive the amounts from all peers
>> int displs = (int*)malloc(comm_size * sizeof(int));  // allocate buffer
>> to compute the displacements for each peer
>> MPI_Allgather( _send_count, 1, MPI_LONG, recv_counts, 1, MPI_LONG,
>> comm);  // exchange the amount of sent data
>> long total = 0;  // we need a total amount of data to be received
>> for( int i = 0; i < comm_size; i++) {
>> d

Re: [OMPI users] Parallel MPI broadcasts (parameterized)

2017-11-07 Thread George Bosilca
If each process send a different amount of data, then the operation should
be an allgatherv. This also requires that you know the amount each process
will send, so you will need a allgather. Schematically the code should look
like the following:

long bytes_send_count = endata.size * sizeof(long);  // compute the amount
of data sent by this process
long* recv_counts = (long*)malloc(comm_size * sizeof(long));  // allocate
buffer to receive the amounts from all peers
int displs = (int*)malloc(comm_size * sizeof(int));  // allocate buffer to
compute the displacements for each peer
MPI_Allgather( _send_count, 1, MPI_LONG, recv_counts, 1, MPI_LONG,
comm);  // exchange the amount of sent data
long total = 0;  // we need a total amount of data to be received
for( int i = 0; i < comm_size; i++) {
displs[i] = total;  // update the displacements
total += recv_counts[i];   // and the total count
}
char* recv_buf = (char*)malloc(total * sizeof(char));  // prepare buffer
for the allgatherv
MPI_Allgatherv( &(endata.data), endata.size*sizeof(char),
MPI_UNSIGNED_CHAR, recv_buf, recv_counts, displs, MPI_UNSIGNED_CHAR, comm);

George.



On Tue, Nov 7, 2017 at 4:23 AM, Konstantinos Konstantinidis <
kostas1...@gmail.com> wrote:

> OK, I started implementing the above Allgather() idea without success
> (segmentation fault). So I will post the problematic lines hare:
>
> * comm.Allgather(&(endata.size), 1, MPI::UNSIGNED_LONG_LONG,
> &(endata_rcv.size), 1, MPI::UNSIGNED_LONG_LONG);*
> * endata_rcv.data = new unsigned char[endata_rcv.size*lineSize];*
> * comm.Allgather(&(endata.data), endata.size*lineSize, MPI::UNSIGNED_CHAR,
> &(endata_rcv.data), endata_rcv.size*lineSize, MPI::UNSIGNED_CHAR);*
> * delete [] endata.data;*
>
> The idea (as it was also for the broadcasts) is first to transmit the data
> size as an unsigned long long integer, so that the receivers will reserve
> the required memory for the actual data to be transmitted after that. To my
> understanding, the problem is that each broadcasted data, let D(s,G), as I
> explained in the previous email is not only different but also has
> different size (in general). That's because if I replace the 3rd line with
>
> * comm.Allgather(&(endata.data), 1, MPI::UNSIGNED_CHAR,
> &(endata_rcv.data), 1, MPI::UNSIGNED_CHAR);*
>
> seems to work without seg. fault but this is pointless for me since I
> don't want only 1 char to be transmitted. So if we see the previous image I
> posted, imagine that the red, green and blue squares are different in size?
> Can Allgather() even work then? If no, do you suggest anything else or I am
> trapped in using the MPI_Bcast() as shown in Option 1?
>
> On Mon, Nov 6, 2017 at 8:58 AM, George Bosilca <bosi...@icl.utk.edu>
> wrote:
>
>> On Sun, Nov 5, 2017 at 10:23 PM, Konstantinos Konstantinidis <
>> kostas1...@gmail.com> wrote:
>>
>>> Hi George,
>>>
>>> First, let me note that the cost of q^(k-1)]*(q-1) communicators was
>>> fine for the values of parameters q,k I am working with. Also, the whole
>>> point of speeding up the shuffling phase is trying to reduce this number
>>> even more (compared to already known implementations) which is a major
>>> concern of my project. But thanks for pointing that out. Btw, do you know
>>> what is the maximum such number in MPI?
>>>
>>
>> Last time I run into such troubles these limits were: 2k for MVAPICH, 16k
>> for MPICH and 2^30-1 for OMPI (all positive signed 23 bits integers). It
>> might have changed meanwhile.
>>
>>
>>> Now to the main part of the question, let me clarify that I have 1
>>> process per machine. I don't know if this is important here but my way of
>>> thinking is that we have a big text file and each process will have to work
>>> on some chunks of it (like chapters of a book). But each process resides in
>>> an machine with some RAM which is able to handle a specific amount of work
>>> so if you generate many processes per machine you must have fewer book
>>> chapters per process than before. Thus, I wanted to avoid thinking in the
>>> process-level rather than machine-level with the RAM limitations.
>>>
>>> Now to the actual shuffling, here is what I am currently doing (Option
>>> 1):
>>>
>>> Let's denote the data that slave s has to send to the slaves in group G
>>> as D(s,G).
>>>
>>> *for each slave s in 1,2,...,K{*
>>>
>>> *for each group G that s participates into{*
>>>
>>> *if (my rank is s){*
>>> *MPI_Bcast(send data D(s,G))*
>>> *}else if(my rank is in group G)*
>>> *  

Re: [OMPI users] Parallel MPI broadcasts (parameterized)

2017-11-06 Thread George Bosilca
llective communications you will be hammering the network in a
significant way (so you will have to take into account the network
congestion); and 3) all processes need to have all memory for receive
allocated for the buffers. Thus, even be implementing a nice communication
scheme you might encounter some performance issues.

Another way to do this is to instead of conglomerating all communications
in a single temporal location you spread them out across time by imposing
your own communication logic. This basically translate a set of blocking
collective (bcast is a perfect target) into a pipelined mix. Instead of
describing such a scheme here I suggest you read the algorithmic
description of the SUMMA and/or PUMMA distributed matrix multiplication.

  George.


I am not sure whether this makes sense since I am confused about the
> correspodence of the data transmitted with Allgather() compared to the
> notation D(s,G) I am currently using.
>
> Thanks.
>
>
> On Tue, Oct 31, 2017 at 11:11 PM, George Bosilca <bosi...@icl.utk.edu>
> wrote:
>
>> It really depends what are you trying to achieve. If the question is
>> rhetorical: "can I write a code that does in parallel broadcasts on
>> independent groups of processes ?" then the answer is yes, this is
>> certainly possible. If however you add a hint of practicality in your
>> question "can I write an efficient parallel broadcast between independent
>> groups of processes?" then I'm afraid the answer will be a negative one.
>>
>> Let's not look at how you can write the multiple bcast code as the answer
>> in the stackoverflow is correct, but instead look at what resources these
>> collective operations are using. In general you can assume that nodes are
>> connected by a network, able to move data at a rate B in both directions
>> (full duplex). Assuming the implementation of the bcast algorithm is not
>> entirely moronic, the bcast can saturate the network with a single process
>> per node. Now, if you have multiple processes per node (P) then either you
>> schedule them sequentially (so that each one has the full bandwidth B) or
>> you let them progress in parallel in which case each participating process
>> can claim a lower bandwidth B/P (as it is shared between all processes on
>> the nore).
>>
>> So even if you are able to expose enough parallelism, physical resources
>> will impose the real hard limit.
>>
>> That being said I have the impression you are trying to implement an
>> MPI_Allgather(v) using a series of MPI_Bcast. Is that true ?
>>
>>   George.
>>
>> PS: Few other constraints: the cost of creating the q^(k-1)]*(q-1)
>> communicator might be prohibitive; the MPI library might support a limited
>> number of communicators.
>>
>>
>> On Tue, Oct 31, 2017 at 11:42 PM, Konstantinos Konstantinidis <
>> kostas1...@gmail.com> wrote:
>>
>>> Assume that we have K=q*k nodes (slaves) where q,k are positive integers
>>> >= 2.
>>>
>>> Based on the scheme that I am currently using I create [q^(k-1)]*(q-1)
>>> groups (along with their communicators). Each group consists of k nodes and
>>> within each group exactly k broadcasts take place (each node broadcasts
>>> something to the rest of them). So in total [q^(k-1)]*(q-1)*k MPI
>>> broadcasts take place. Let me skip the details of the above scheme.
>>>
>>> Now theoretically I figured out that there are q-1 groups that can
>>> communicate in parallel at the same time i.e. groups that have no common
>>> nodes and I would like to utilize that to speedup the shuffling. I have
>>> seen here https://stackoverflow.com/questions/11372012/mpi-severa
>>> l-broadcast-at-the-same-time that this is possible in MPI.
>>>
>>> In my case it's more complicated since q,k are parameters of the problem
>>> and change between different experiments. If I get the idea about the 2nd
>>> method that is proposed there and assume that we have only 3 groups within
>>> which some communication takes places one can simply do:
>>>
>>> *if my rank belongs to group 1{*
>>> *comm1.Bcast(..., ..., ..., rootId);*
>>> *}else if my rank belongs to group 2{*
>>> *comm2.Bcast(..., ..., ..., rootId);*
>>> *}else if my rank belongs to group3{*
>>> *comm3.Bcast(..., ..., ..., rootId);*
>>> *} *
>>>
>>> where comm1, comm2, comm3 are the corresponding sub-communicators that
>>> contain only the members of each group.
>>>
>>> But how can I generalize the above idea to arbitrary number of groups or
>>> perhaps do something else?
>>>
>>> The code is in C++ and the MPI installed is described in the attached
>>> file.
>>>
>>> Regards,
>>> Kostas
>>>
>>>
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>>
>>
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Parallel MPI broadcasts (parameterized)

2017-10-31 Thread George Bosilca
It really depends what are you trying to achieve. If the question is
rhetorical: "can I write a code that does in parallel broadcasts on
independent groups of processes ?" then the answer is yes, this is
certainly possible. If however you add a hint of practicality in your
question "can I write an efficient parallel broadcast between independent
groups of processes?" then I'm afraid the answer will be a negative one.

Let's not look at how you can write the multiple bcast code as the answer
in the stackoverflow is correct, but instead look at what resources these
collective operations are using. In general you can assume that nodes are
connected by a network, able to move data at a rate B in both directions
(full duplex). Assuming the implementation of the bcast algorithm is not
entirely moronic, the bcast can saturate the network with a single process
per node. Now, if you have multiple processes per node (P) then either you
schedule them sequentially (so that each one has the full bandwidth B) or
you let them progress in parallel in which case each participating process
can claim a lower bandwidth B/P (as it is shared between all processes on
the nore).

So even if you are able to expose enough parallelism, physical resources
will impose the real hard limit.

That being said I have the impression you are trying to implement an
MPI_Allgather(v) using a series of MPI_Bcast. Is that true ?

  George.

PS: Few other constraints: the cost of creating the q^(k-1)]*(q-1)
communicator might be prohibitive; the MPI library might support a limited
number of communicators.


On Tue, Oct 31, 2017 at 11:42 PM, Konstantinos Konstantinidis <
kostas1...@gmail.com> wrote:

> Assume that we have K=q*k nodes (slaves) where q,k are positive integers
> >= 2.
>
> Based on the scheme that I am currently using I create [q^(k-1)]*(q-1)
> groups (along with their communicators). Each group consists of k nodes and
> within each group exactly k broadcasts take place (each node broadcasts
> something to the rest of them). So in total [q^(k-1)]*(q-1)*k MPI
> broadcasts take place. Let me skip the details of the above scheme.
>
> Now theoretically I figured out that there are q-1 groups that can
> communicate in parallel at the same time i.e. groups that have no common
> nodes and I would like to utilize that to speedup the shuffling. I have
> seen here https://stackoverflow.com/questions/11372012/mpi-
> several-broadcast-at-the-same-time that this is possible in MPI.
>
> In my case it's more complicated since q,k are parameters of the problem
> and change between different experiments. If I get the idea about the 2nd
> method that is proposed there and assume that we have only 3 groups within
> which some communication takes places one can simply do:
>
> *if my rank belongs to group 1{*
> *comm1.Bcast(..., ..., ..., rootId);*
> *}else if my rank belongs to group 2{*
> *comm2.Bcast(..., ..., ..., rootId);*
> *}else if my rank belongs to group3{*
> *comm3.Bcast(..., ..., ..., rootId);*
> *} *
>
> where comm1, comm2, comm3 are the corresponding sub-communicators that
> contain only the members of each group.
>
> But how can I generalize the above idea to arbitrary number of groups or
> perhaps do something else?
>
> The code is in C++ and the MPI installed is described in the attached file.
>
> Regards,
> Kostas
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Force TCP to go through network adapter for single-machine test.

2017-10-30 Thread George Bosilca
John,

To disable shared memory (sm or vader in Open MPI depending on the
version), you have to remove it from the list of approved underlying
network devices. As you specifically want TCP support, I would add "--mca
btl tcp,self" to my mpirun command line. However, by default Open MPI tries
to avoid using the loop back device to communication between processes on
the same node, so you will have to manually allow the loopback by adding
"--mca btl_tcp_if_include 127.0.0.1/24" to your command line.

More info can be found https://www.open-mpi.org/faq/?category=tcp

  George.



On Mon, Oct 30, 2017 at 1:18 PM, John Moore  wrote:

> I would like to test the network latency/bandwidth of each node that I am
> running on in parallel. I think the simplest way to do this would be to
> have each node test itself.
>
> My question is: How can I force all the OpenMPI TCP communication to go
> through the network adapter, and not use the optimized node-local
> communication?
>
> Any advice would be greatly appreciated.
>
> Best Regards,
> John
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread George Bosilca
John,

On the ULFM mailing list you pointed out, we converged toward a hardware
issue. Resources associated with the dead process were not correctly freed,
and follow-up processes on the same setup would inherit issues related to
these lingering messages. However, keep in mind that the setup was
different as we were talking about losing a process.

The proposed solution to force the timeout to a large value was not fixing
the problem, it was just delaying it enough for the application to run to
completion.

  George.


On Thu, Sep 28, 2017 at 5:17 AM, John Hearns via users <
users@lists.open-mpi.org> wrote:

> ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ?
>
> On 28 September 2017 at 11:17, John Hearns  wrote:
>
>>
>> Google turns this up:
>> https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls
>>
>>
>> On 28 September 2017 at 01:26, Ludovic Raess 
>> wrote:
>>
>>> Hi,
>>>
>>>
>>> we have a issue on our 32 nodes Linux cluster regarding the usage of
>>> Open MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR
>>> single port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7).
>>>
>>>
>>> On long runs (over ~10 days) involving more than 1 node (usually 64 MPI
>>> processes distributed on 16 nodes [node01-node16]​), we observe the freeze
>>> of the simulation due to an internal error displaying: "error polling LP CQ
>>> with status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1
>>>  vendor error 136 qp_idx 0" (see attached file for full output).
>>>
>>>
>>> The job hangs, no computation neither communication occurs anymore, but
>>> no exit neither unload of the nodes is observed. The job can be killed
>>> normally but then the concerned nodes do not fully recover. A relaunch of
>>> the simulation usually sustains a couple of iterations (few minutes
>>> runtime), and then the job hangs again due to similar reasons. The only
>>> workaround so far is to reboot the involved nodes.
>>>
>>>
>>> Since we didn't find any hints on the web regarding this
>>> strange behaviour, I am wondering if this is a known issue. We actually
>>> don't know what causes this to happen and why. So any hints were to start
>>> investigating or possible reasons for this to happen are welcome.​
>>>
>>>
>>> Ludovic
>>>
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>>
>>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI v3.0 on Cygwin

2017-09-27 Thread George Bosilca
On Thu, Sep 28, 2017 at 12:45 AM, Fab Tillier via users <
users@lists.open-mpi.org> wrote:

> Hi Llelan,
>
> Llelan D. wrote on Wed, 27 Sep 2017 at 19:06:23
>
> > On 09/27/2017 3:04 PM, Jeffrey A Cummings wrote:
> >> The MS-MPI developers disagree with your statement below and claim to
> >> be actively working on MPI-3.0 compliance, with an expected new version
> >> release about every six month.
> >
> > They can disagree as much as they want. I've spent over 30 years doing
> > contracts for and associated with MS and am very familiar with their
> policy of
> > what they claim vs. what they do. Check out:
> >http://mpi-forum.org/slides/2014/11/mpi3-impl-status-Nov14.pdf
> > for where msmpi was in 2014 when they were claiming
> > the same things. The latest version of msmpi (v8.1.12438) still only
> > provides minimal support for MPI specification v2.0.
>
> Did you mean MPI 3.0 here?  We've had comprehensive support for MPI 2.0
> for quite some time, and as of version 8.1 believe we should be MPI 2.2
> compliant as well as supporting a fair bit of MPI 3.0.  We've been steadily
> adding MPI 3 support each release, and tend to prioritize development based
> on user feedback, so if there are APIs you need, please let us know - you
> can contact the MS-MPI team directly at mailto:ask...@microsoft.com.
>
> You should be able to see the evolution through the MPICH BOF slides from
> past super computing conferences.  The 2015 slides are here:
> https://www.mpich.org/files/2015/11/SC15-MPICH-BoF.pdf, but unfortunately
> the 2016 slides link is broken.
>

The link to the slides presenting the level of support of the MPI
specification by different MPI implementations can be found on the main
page of the MPI Forum at http://mpi-forum.org/ (look for Implementation
Status).

  George.


>
> > Understand, I'm no MS basher; Windows is still the most likely
> development
> > environment in the industry and must be respected. This is why I always
> > argue that it is a mistake not to distribute native MS versions of
> packages no
> > matter what level of popular support there is.
> > Allowing MS to restrict the level of support on the Windows platform to
> only
> > the avenues they wish developers to use is a huge restriction for the
> > evolution of a specification and a terrible problem for those of us who
> must
> > work cross-platform.
>
> I think you misconstrue our objectives here.  If there are APIs you would
> like us to support, let us know so that we can prioritize their
> implementation.  We also don't prevent or discourage alternative
> implementations, and while we would be more than happy to see Windows
> native Open MPI releases, we understand and respect the Open MPI
> developers' decision to invest their time elsewhere.
>
> Cheers,
> -Fab
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Multi-threaded MPI communication

2017-09-21 Thread George Bosilca
All your processes send their data to a single destination, in same time.
Clearly you are reaching the capacity of your network and your data
transfers will be bound by this. This is a physical constraint that you can
only overcome by adding network capacity to your cluster.

At the software level only possibility is to make each of the p slave
processes send their data to your centralize resource at different time, so
that the data has the time to be transferred through the network before the
next slave is ready to submit it's result.

  George




On Thu, Sep 21, 2017 at 4:57 PM, saiyedul islam 
wrote:

> Hi all,
>
> I am working on parallelization of a Data Clustering Algorithm in which I
> am following MPMD pattern of MPI (i.e. 1 master process and p slave
> processes in same communicator). It is an iterative algorithm where 2 loops
> inside iteration are separately parallelized.
>
> The first loop is parallelized by partitioning the N size input data into
> (almost) equal parts between p slaves. Each slave produces a contiguous
> chunk of about (p * N/p) double values as result of its local processing.
> This local chunk from each slave is collected back on master process where
> it is merged with chunks from other slaves.
> If a blocking call (MPI_Send / Recv) is put in a loop on master such that
> it receives the data one by one in order of their rank from slaves, then
> each slave takes about 75 seconds for its local computation (as calculated
> by MPI_Wtime() ) and about 1.5 seconds for transferring its chunk to
> master. But, as the transfer happens in order, by the time last slave
> process is done, the total time becomes 75 seconds for computation and 50
> seconds for communication.
> These timings are for a cluster of 31 machines where a single process
> executes in each machine. All these machines are connected directly via a
> private Gigabit network switch. In order to be effectively parallelize the
> algorithm, the overall execution time needs to come below 80 seconds.
>
> I have tried following strategies to solve this problem:
> 0. Ordered transfer, as explained above.
> 1. Collecting data through MPI_Gatherv and assuming that internally it
> will transfer data in parallel.
> 2. Creating p threads at master using OpenMP and calling MPI_Recv (or
> MPI_Irecv with MPI_Wait) by threads. The received data by each process is
> put in a separate buffer. My installation support MPI_THREAD_MULTIPLE.
>
> The problem is that strategies 1 & 2 are taking almost similar time as
> compared to strategy 0.
> *Is there a way through which I can receive data in parallel and
> substantially decrease the overall execution time?*
>
> Hoping to get your help soon. Sorry for the long question.
>
> Regards,
> Saiyedul Islam
>
> PS: Specifications of the cluster: GCC 5.10, OpenMP 2.0.1, CentOS 6.5 (as
> part of Rockscluster).
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Groups and Communicators

2017-08-02 Thread George Bosilca
Diego,

Setting the color to MPI_COMM_NULL is not good, as it results in some
random value (and not the MPI_UNDEFINED that do not generate a
communicator). Change the color to MPI_UNDEFINED and your application
should work just fine (in the sense that all processes not in the master
communicator will have the master_comm variable set to MPI_COMM_NULL).

  George.



On Wed, Aug 2, 2017 at 10:15 AM, Diego Avesani 
wrote:

> Dear Jeff, Dear all,
>
> thanks, I will try immediately.
>
> thanks again
>
>
>
> Diego
>
>
> On 2 August 2017 at 14:01, Jeff Squyres (jsquyres) 
> wrote:
>
>> Just like in your original code snippet, you can
>>
>> If (master_comm .ne. Mpi_comm_null) then
>>...
>>
>> Sent from my phone. No type good.
>>
>> On Aug 2, 2017, at 7:17 AM, Diego Avesani 
>> wrote:
>>
>> Dear all, Dear Jeff,
>>
>> I am very sorry, but I do not know how to do this kind of comparison.
>>
>> this is my peace of code:
>>
>> CALL MPI_GROUP_INCL(GROUP_WORLD, nPSObranch, MRANKS, MASTER_GROUP,ierr)
>>  CALL MPI_COMM_CREATE_GROUP(MPI_COMM_WORLD,MASTER_GROUP,0,MASTER_
>> COMM,iErr)
>>  !
>>  IF(MPI_COMM_NULL .NE. MASTER_COMM)THEN
>> CALL MPI_COMM_RANK(MASTER_COMM, MPImaster%rank,MPIlocal%iErr)
>> CALL MPI_COMM_SIZE(MASTER_COMM, MPImaster%nCPU,MPIlocal%iErr)
>>  ELSE
>> MPImaster%rank = MPI_PROC_NULL
>>  ENDIF
>>
>> and then
>>
>>  IF(MPImaster%rank.GE.0)THEN
>> CALL MPI_SCATTER(PP, 10, MPI_DOUBLE, PPL, 10,MPI_DOUBLE, 0,
>> MASTER_COMM, iErr)
>>  ENDIF
>>
>> What I should compare?
>>
>> Thanks again
>>
>> Diego
>>
>>
>> On 1 August 2017 at 16:18, Jeff Squyres (jsquyres) 
>> wrote:
>>
>>> On Aug 1, 2017, at 5:56 AM, Diego Avesani 
>>> wrote:
>>> >
>>> > If I do this:
>>> >
>>> > CALL MPI_SCATTER(PP, npart, MPI_DOUBLE, PPL, 10,MPI_DOUBLE, 0,
>>> MASTER_COMM, iErr)
>>> >
>>> > I get an error. This because some CPU does not belong to MATER_COMM.
>>> The alternative should be:
>>> >
>>> > IF(rank.LT.0)THEN
>>> > CALL MPI_SCATTER(PP, npart, MPI_DOUBLE, PPL, 10,MPI_DOUBLE, 0,
>>> MASTER_COMM, iErr)
>>> > ENDIF
>>>
>>> MPI_PROC_NULL is a sentinel value; I don't think you can make any
>>> assumptions about its value (i.e., that it's negative).  In practice, it
>>> probably always is, but if you want to check the rank, you should compare
>>> it to MPI_PROC_NULL.
>>>
>>> That being said, comparing MASTER_COMM to MPI_COMM_NULL is no more
>>> expensive than comparing an integer. So that might be a bit more expressive
>>> to read / easier to maintain over time, and it won't cost you any
>>> performance.
>>>
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>>
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Groups and Communicators

2017-07-28 Thread George Bosilca
I guess the second comm_rank call is invalid on all non-leader processes,
as their LEADER_COMM communicator is MPI_COMM_NULL.

george

On Fri, Jul 28, 2017 at 05:06 Diego Avesani <diego.aves...@gmail.com> wrote:

> Dear George, Dear all,
>
> thanks, thanks a lot. I will tell you everything.
> I will try also to implement your suggestion.
>
> Unfortunately,  the program that I have show to you is not working. I get
> the following error:
>
> [] *** An error occurred in MPI_Comm_rank
> [] *** reported by process [643497985,7]
> [] *** on communicator MPI_COMM_WORLD
> [] *** MPI_ERR_COMM: invalid communicator
> [] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> [] ***and potentially your MPI job)
> [warn] Epoll ADD(4) on fd 47 failed.  Old events were 0; read change was 0
> (none); write change was 1 (add): Bad file descriptor
> [warn] Epoll ADD(4) on fd 65 failed.  Old events were 0; read change was 0
> (none); write change was 1 (add): Bad file descriptor
> [] 8 more processes have sent help message help-mpi-errors.txt /
> mpi_errors_are_fatal
> [] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help /
> error messages
>
> What do you think could be the error?
>
> Really, Really thanks again
>
>
>
>
>
> Diego
>
>
> On 27 July 2017 at 15:57, George Bosilca <bosi...@icl.utk.edu> wrote:
>
>> This looks good. If performance is critical you can speed up the entire
>> process by using MPI_Comm_create_group instead of the second
>> MPI_COMM_SPLIT. The MPI_Comm_create_group is collective only over the
>> resulting communicator and not over the source communicator, so its cost is
>> only dependent of the number of groups and not on the total number of
>> processes.
>>
>> You can also try to replace the first MPI_COMM_SPLIT by the same
>> approach. I would be curious to see the outcome.
>>
>>   George.
>>
>>
>> On Thu, Jul 27, 2017 at 9:44 AM, Diego Avesani <diego.aves...@gmail.com>
>> wrote:
>>
>>> Dear George, Dear all,
>>>
>>> I have tried to create a simple example. In particular, I would like to
>>> use 16 CPUs and to create four groups according to rank is and then a
>>> communicator between masters of each group.I have tried to follow the first
>>> part of this example
>>> <http://mpitutorial.com/tutorials/introduction-to-groups-and-communicators/>.
>>> In the last part I have tried to create a communicator be masters as
>>> suggested by George.
>>>
>>> Here my example:
>>>
>>>  CALL MPI_INIT(ierror)
>>>  CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nCPU, ierror)
>>>  CALL MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
>>>  !
>>>  colorloc = rank/4
>>>  !
>>>  CALL MPI_COMM_SPLIT(MPI_COMM_WORLD,colorloc,rank,*NEW_COMM*,ierror)
>>>  CALL MPI_COMM_RANK(*NEW_COMM*, subrank,ierror);
>>>  CALL MPI_COMM_SIZE(*NEW_COMM*, subnCPU,ierror);
>>>  !
>>>  IF(MOD(rank,4).EQ.0)THEN
>>> *! where I set color for the masters*
>>> colorglobal = MOD(rank,4)
>>>  ELSE
>>> colorglobal = MPI_UNDEFINED
>>>  ENDIF
>>>  !
>>>  CALL MPI_COMM_SPLIT(MPI_COMM_WORLD,colorglobal,rank,LEADER_COMM,ierror)
>>>  CALL MPI_COMM_RANK(*LEADER_COMM*, leader_rank,ierror);
>>>  CALL MPI_FINALIZE(ierror)
>>>
>>> I would like to know if this could be correct. I mean if I have
>>> understood correctly what George told me about the code design. Now, this
>>> example does not work, but probably there is some coding error.
>>>
>>> Really, Really thanks
>>> Diego
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Diego
>>>
>>>
>>> On 27 July 2017 at 10:42, Diego Avesani <diego.aves...@gmail.com> wrote:
>>>
>>>> Dear George, Dear all,
>>>>
>>>> A question regarding program design:
>>>> The draft that I have sent to you has to be done many and many times.
>>>> Does the splitting procedure ensure efficiency?
>>>>
>>>> I will try, at a lest to create groups and split them. I am a beginner
>>>> in the MPI groups environment.
>>>> really, really thanks.
>>>>
>>>> You are my lifesaver.
>>>>
>>>>
>>>>
>>>> Diego
>>>>
>>>>
>>>> On 26 July 2017 at 15:09, George Bosilca <bosi...@icl.utk.edu> wrote:
>>>>
>>>>>

Re: [OMPI users] Groups and Communicators

2017-07-27 Thread George Bosilca
This looks good. If performance is critical you can speed up the entire
process by using MPI_Comm_create_group instead of the second
MPI_COMM_SPLIT. The MPI_Comm_create_group is collective only over the
resulting communicator and not over the source communicator, so its cost is
only dependent of the number of groups and not on the total number of
processes.

You can also try to replace the first MPI_COMM_SPLIT by the same approach.
I would be curious to see the outcome.

  George.


On Thu, Jul 27, 2017 at 9:44 AM, Diego Avesani <diego.aves...@gmail.com>
wrote:

> Dear George, Dear all,
>
> I have tried to create a simple example. In particular, I would like to
> use 16 CPUs and to create four groups according to rank is and then a
> communicator between masters of each group.I have tried to follow the first
> part of this example
> <http://mpitutorial.com/tutorials/introduction-to-groups-and-communicators/>.
> In the last part I have tried to create a communicator be masters as
> suggested by George.
>
> Here my example:
>
>  CALL MPI_INIT(ierror)
>  CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nCPU, ierror)
>  CALL MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
>  !
>  colorloc = rank/4
>  !
>  CALL MPI_COMM_SPLIT(MPI_COMM_WORLD,colorloc,rank,*NEW_COMM*,ierror)
>  CALL MPI_COMM_RANK(*NEW_COMM*, subrank,ierror);
>  CALL MPI_COMM_SIZE(*NEW_COMM*, subnCPU,ierror);
>  !
>  IF(MOD(rank,4).EQ.0)THEN
> *! where I set color for the masters*
> colorglobal = MOD(rank,4)
>  ELSE
> colorglobal = MPI_UNDEFINED
>  ENDIF
>  !
>  CALL MPI_COMM_SPLIT(MPI_COMM_WORLD,colorglobal,rank,LEADER_COMM,ierror)
>  CALL MPI_COMM_RANK(*LEADER_COMM*, leader_rank,ierror);
>  CALL MPI_FINALIZE(ierror)
>
> I would like to know if this could be correct. I mean if I have understood
> correctly what George told me about the code design. Now, this example does
> not work, but probably there is some coding error.
>
> Really, Really thanks
> Diego
>
>
>
>
>
>
>
> Diego
>
>
> On 27 July 2017 at 10:42, Diego Avesani <diego.aves...@gmail.com> wrote:
>
>> Dear George, Dear all,
>>
>> A question regarding program design:
>> The draft that I have sent to you has to be done many and many times.
>> Does the splitting procedure ensure efficiency?
>>
>> I will try, at a lest to create groups and split them. I am a beginner in
>> the MPI groups environment.
>> really, really thanks.
>>
>> You are my lifesaver.
>>
>>
>>
>> Diego
>>
>>
>> On 26 July 2017 at 15:09, George Bosilca <bosi...@icl.utk.edu> wrote:
>>
>>> Diego,
>>>
>>> As all your processes are started under the umbrella of a single mpirun,
>>> they have a communicator in common, the MPI_COMM_WORLD.
>>>
>>> One possible implementation, using MPI_Comm_split, will be the following:
>>>
>>> MPI_Comm small_comm, leader_comm;
>>>
>>> /* Create small_comm on all processes */
>>>
>>> /* Now use MPI_Comm_split on MPI_COMM_WORLD to select the leaders */
>>> MPI_Comm_split( MPI_COMM_WORLD,
>>>  i_am_leader(small_comm) ? 1 :
>>> MPI_UNDEFINED,
>>>  rank_in_comm_world,
>>>  _Comm);
>>>
>>> The leader_comm will be a valid communicator on all leaders processes,
>>> and MPI_COMM_NULL on all others.
>>>
>>>   George.
>>>
>>>
>>>
>>> On Wed, Jul 26, 2017 at 4:29 AM, Diego Avesani <diego.aves...@gmail.com>
>>> wrote:
>>>
>>>> Dear George, Dear all,
>>>>
>>>> I use "mpirun -np xx ./a.out"
>>>>
>>>> I do not know if I have some common  grounds. I mean, I have to
>>>> design everything from the begging. You can find what I would like to do in
>>>> the attachment. Basically, an MPI cast in another MPI. Consequently, I am
>>>> thinking to MPI groups or MPI virtual topology with a 2D cart, using the
>>>> columns as "groups" and the first rows as the external groups to handle the
>>>> columns.
>>>>
>>>> What do think? What do you suggest?
>>>> Really Really thanks
>>>>
>>>>
>>>> Diego
>>>>
>>>>
>>>> On 25 July 2017 at 19:26, George Bosilca <bosi...@icl.utk.edu> wrote:
>>>>
>>>>> Diego,
>>>>>
>>>>> Assuming you have some common  grounds between the 4 initial groups
>>>>> (otherwise you w

Re: [OMPI users] MPI_IN_PLACE

2017-07-26 Thread George Bosilca
Volker,

Unfortunately, I can't replicate with icc. I tried on a x86_64 box with
Intel compiler chain 17.0.4 20170411 to no avail. I also tested the
3.0.0-rc1 tarball and the current master, and you test completes without
errors on all cases.

Once you figure out an environment where you can consistently replicate the
issue, I would suggest to attach to the processes and:
- make sure the MPI_IN_PLACE as seen through the Fortran layer matches what
the C layer expects
- what is the collective algorithm used by Open MPI

I have a "Fortran 101" level question. When you pass an array a(:) as
argument, what exactly gets passed via the Fortran interface to the
corresponding C function ?

  George.

On Wed, Jul 26, 2017 at 1:55 PM, Volker Blum <volker.b...@duke.edu> wrote:

> Thanks! Yes, trying with Intel 2017 would be very nice.
>
> > On Jul 26, 2017, at 6:12 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
> >
> > No, I don't have (or used where they were available) the Intel compiler.
> I used clang and gfortran. I can try on a Linux box with the Intel 2017
> compilers.
> >
> >   George.
> >
> >
> >
> > On Wed, Jul 26, 2017 at 11:59 AM, Volker Blum <volker.b...@duke.edu>
> wrote:
> > Did you use Intel Fortran 2017 as well?
> >
> > (I’m asking because I did see the same issue with a combination of an
> earlier Intel Fortran 2017 version and OpenMPI on an Intel/Infiniband Linux
> HPC machine … but not Intel Fortran 2016 on the same machine. Perhaps I can
> revive my access to that combination somehow.)
> >
> > Best wishes
> > Volker
> >
> > > On Jul 26, 2017, at 5:55 PM, George Bosilca <bosi...@icl.utk.edu>
> wrote:
> > >
> > > I thought that maybe the underlying allreduce algorithm fails to
> support MPI_IN_PLACE correctly, but I can't replicate on any machine
> (including OSX) with any number of processes.
> > >
> > >   George.
> > >
> > >
> > >
> > > On Wed, Jul 26, 2017 at 10:59 AM, Volker Blum <volker.b...@duke.edu>
> wrote:
> > > Thanks!
> > >
> > > I tried ‘use mpi’, which compiles fine.
> > >
> > > Same result as with ‘include mpif.h', in that the output is
> > >
> > >  * MPI_IN_PLACE does not appear to work as intended.
> > >  * Checking whether MPI_ALLREDUCE works at all.
> > >  * Without MPI_IN_PLACE, MPI_ALLREDUCE appears to work.
> > >
> > > Hm. Any other thoughts?
> > >
> > > Thanks again!
> > > Best wishes
> > > Volker
> > >
> > > > On Jul 26, 2017, at 4:06 PM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
> > > >
> > > > Volker,
> > > >
> > > > With mpi_f08, you have to declare
> > > >
> > > > Type(MPI_Comm) :: mpi_comm_global
> > > >
> > > > (I am afk and not 100% sure of the syntax)
> > > >
> > > > A simpler option is to
> > > >
> > > > use mpi
> > > >
> > > > Cheers,
> > > >
> > > > Gilles
> > > >
> > > > Volker Blum <volker.b...@duke.edu> wrote:
> > > >> Hi Gilles,
> > > >>
> > > >> Thank you very much for the response!
> > > >>
> > > >> Unfortunately, I don’t have access to a different system with the
> issue right now. As I said, it’s not new; it just keeps creeping up
> unexpectedly again on different platforms. What puzzles me is that I’ve
> encountered the same problem with low but reasonable frequency over a
> period of now over five years.
> > > >>
> > > >> We can’t require F’08 in our application, unfortunately, since this
> standard is too new. Since we maintain a large application that has to run
> on a broad range of platforms, Fortran 2008 would not work for many of our
> users. In a few years, this will be different, but not yet.
> > > >>
> > > >> On gfortran: In our own tests, unfortunately, Intel Fortran
> consistently produced much faster executable code in the past. The latter
> observation may also change someday, but for us, the performance difference
> was an important constraint.
> > > >>
> > > >> I did suspect mpif.h, too. Not sure how to best test this
> hypothesis, however.
> > > >>
> > > >> Just replacing
> > > >>
> > > >>> include 'mpif.h'
> > > >>> with
> > > >>> use mpi_f08
> > > >>
> > > >> did not

Re: [OMPI users] MPI_IN_PLACE

2017-07-26 Thread George Bosilca
No, I don't have (or used where they were available) the Intel compiler. I
used clang and gfortran. I can try on a Linux box with the Intel 2017
compilers.

  George.



On Wed, Jul 26, 2017 at 11:59 AM, Volker Blum <volker.b...@duke.edu> wrote:

> Did you use Intel Fortran 2017 as well?
>
> (I’m asking because I did see the same issue with a combination of an
> earlier Intel Fortran 2017 version and OpenMPI on an Intel/Infiniband Linux
> HPC machine … but not Intel Fortran 2016 on the same machine. Perhaps I can
> revive my access to that combination somehow.)
>
> Best wishes
> Volker
>
> > On Jul 26, 2017, at 5:55 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
> >
> > I thought that maybe the underlying allreduce algorithm fails to support
> MPI_IN_PLACE correctly, but I can't replicate on any machine (including
> OSX) with any number of processes.
> >
> >   George.
> >
> >
> >
> > On Wed, Jul 26, 2017 at 10:59 AM, Volker Blum <volker.b...@duke.edu>
> wrote:
> > Thanks!
> >
> > I tried ‘use mpi’, which compiles fine.
> >
> > Same result as with ‘include mpif.h', in that the output is
> >
> >  * MPI_IN_PLACE does not appear to work as intended.
> >  * Checking whether MPI_ALLREDUCE works at all.
> >  * Without MPI_IN_PLACE, MPI_ALLREDUCE appears to work.
> >
> > Hm. Any other thoughts?
> >
> > Thanks again!
> > Best wishes
> > Volker
> >
> > > On Jul 26, 2017, at 4:06 PM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
> > >
> > > Volker,
> > >
> > > With mpi_f08, you have to declare
> > >
> > > Type(MPI_Comm) :: mpi_comm_global
> > >
> > > (I am afk and not 100% sure of the syntax)
> > >
> > > A simpler option is to
> > >
> > > use mpi
> > >
> > > Cheers,
> > >
> > > Gilles
> > >
> > > Volker Blum <volker.b...@duke.edu> wrote:
> > >> Hi Gilles,
> > >>
> > >> Thank you very much for the response!
> > >>
> > >> Unfortunately, I don’t have access to a different system with the
> issue right now. As I said, it’s not new; it just keeps creeping up
> unexpectedly again on different platforms. What puzzles me is that I’ve
> encountered the same problem with low but reasonable frequency over a
> period of now over five years.
> > >>
> > >> We can’t require F’08 in our application, unfortunately, since this
> standard is too new. Since we maintain a large application that has to run
> on a broad range of platforms, Fortran 2008 would not work for many of our
> users. In a few years, this will be different, but not yet.
> > >>
> > >> On gfortran: In our own tests, unfortunately, Intel Fortran
> consistently produced much faster executable code in the past. The latter
> observation may also change someday, but for us, the performance difference
> was an important constraint.
> > >>
> > >> I did suspect mpif.h, too. Not sure how to best test this hypothesis,
> however.
> > >>
> > >> Just replacing
> > >>
> > >>> include 'mpif.h'
> > >>> with
> > >>> use mpi_f08
> > >>
> > >> did not work, for me.
> > >>
> > >> This produces a number of compilation errors:
> > >>
> > >> blum:/Users/blum/codes/fhi-aims/openmpi_test> mpif90
> check_mpi_in_place_08.f90 -o check_mpi_in_place_08.x
> > >> check_mpi_in_place_08.f90(55): error #6303: The assignment operation
> or the binary expression operation is invalid for the data types of the two
> operands.   [MPI_COMM_WORLD]
> > >>   mpi_comm_global = MPI_COMM_WORLD
> > >> --^
> > >> check_mpi_in_place_08.f90(57): error #6285: There is no matching
> specific subroutine for this generic subroutine call.   [MPI_COMM_SIZE]
> > >>   call MPI_COMM_SIZE(mpi_comm_global, n_tasks, mpierr)
> > >> -^
> > >> check_mpi_in_place_08.f90(58): error #6285: There is no matching
> specific subroutine for this generic subroutine call.   [MPI_COMM_RANK]
> > >>   call MPI_COMM_RANK(mpi_comm_global, myid, mpierr)
> > >> -^
> > >> check_mpi_in_place_08.f90(75): error #6285: There is no matching
> specific subroutine for this generic subroutine call.   [MPI_ALLREDUCE]
> > >>   call MPI_ALLREDUCE(MPI_IN_PLACE, &
> > >> -^
> > >> check_mpi_in_place_08.f90(94): err

Re: [OMPI users] Groups and Communicators

2017-07-26 Thread George Bosilca
Diego,

As all your processes are started under the umbrella of a single mpirun,
they have a communicator in common, the MPI_COMM_WORLD.

One possible implementation, using MPI_Comm_split, will be the following:

MPI_Comm small_comm, leader_comm;

/* Create small_comm on all processes */

/* Now use MPI_Comm_split on MPI_COMM_WORLD to select the leaders */
MPI_Comm_split( MPI_COMM_WORLD,
 i_am_leader(small_comm) ? 1 : MPI_UNDEFINED,
 rank_in_comm_world,
 _Comm);

The leader_comm will be a valid communicator on all leaders processes, and
MPI_COMM_NULL on all others.

  George.



On Wed, Jul 26, 2017 at 4:29 AM, Diego Avesani <diego.aves...@gmail.com>
wrote:

> Dear George, Dear all,
>
> I use "mpirun -np xx ./a.out"
>
> I do not know if I have some common  grounds. I mean, I have to
> design everything from the begging. You can find what I would like to do in
> the attachment. Basically, an MPI cast in another MPI. Consequently, I am
> thinking to MPI groups or MPI virtual topology with a 2D cart, using the
> columns as "groups" and the first rows as the external groups to handle the
> columns.
>
> What do think? What do you suggest?
> Really Really thanks
>
>
> Diego
>
>
> On 25 July 2017 at 19:26, George Bosilca <bosi...@icl.utk.edu> wrote:
>
>> Diego,
>>
>> Assuming you have some common  grounds between the 4 initial groups
>> (otherwise you will have to connect them via 
>> MPI_Comm_connect/MPI_Comm_accept)
>> you can merge the 4 groups together and then use any MPI mechanism to
>> create a partial group of leaders (such as MPI_Comm_split).
>>
>> If you spawn the groups via MPI_Comm_spawn then the answer is slightly
>> more complicated, you need to use MPI_Intercomm_create, with the spawner as
>> the bridge between the different communicators (and then
>> MPI_Intercomm_merge to create your intracomm). You can find a good answer
>> on stackoverflow on this at https://stackoverflow.com/ques
>> tions/24806782/mpi-merge-multiple-intercoms-into-a-single-intracomm
>>
>> How is your MPI environment started (single mpirun or mpi_comm_spawn) ?
>>
>>   George.
>>
>>
>>
>> On Tue, Jul 25, 2017 at 10:44 AM, Diego Avesani <diego.aves...@gmail.com>
>> wrote:
>>
>>> Dear All,
>>>
>>> I am studying Groups and Communicators, but before start going in
>>> detail, I have a question about groups.
>>>
>>> I would like to know if is it possible to create a group of masters of
>>> the other groups and then a intra-communication in the new group. I have
>>> spent sometime reading different tutorial and presentation, but it is
>>> difficult, at least for me, to understand if is it possible to create this
>>> sort of MPI cast in another MPI.
>>>
>>> In the attachment you can find a pictures that summarize what I would
>>> like to do.
>>>
>>> Another strategies could be use virtual topology.
>>>
>>> What do you think?
>>>
>>> I really, really, appreciate any kind of help, suggestions or link where
>>> I can study this topics.
>>>
>>> Again, thanks
>>>
>>> Best Regards,
>>>
>>> Diego
>>>
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

  1   2   3   4   5   6   7   >