Re: [OMPI users] Segfault in ucp_dt_pack function from UCX library 1.8.0 and 1.11.2 for large sized communications using both OpenMPI 4.0.3 and 4.1.2

2022-06-02 Thread Josh Hursey via users
I would suggest trying OMPI v4.1.4 (or the v5 snapshot)
 * https://www.open-mpi.org/software/ompi/v4.1/ 
 
 * https://www.mail-archive.com/announce@lists.open-mpi.org//msg00152.html 
 

We fixed some large payload collective issues in that release which might be 
what you are seeing here with MPI_Alltoallv with the tuned collective component.



On Thu, Jun 2, 2022 at 1:54 AM Mikhail Brinskii via users 
mailto:users@lists.open-mpi.org> > wrote:
Hi Eric,

 

Yes, UCX is supposed to be stable for large sized problems.

Did you see the same crash with both OMPI-4.0.3 + UCX 1.8.0 and OMPI-4.1.2 + 
UCX1.11.2?

Have you also tried to run large sized problems test with OMPI-5.0.x?

Regarding the application, at some point it invokes MPI_Alltoallv sending more 
than 2GB to some of the ranks (using derived dt), right?

 

//WBR, Mikhail

 

From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Eric Chamberland via 
users
Sent: Thursday, June 2, 2022 5:31 AM
To: Open MPI Users mailto:users@lists.open-mpi.org> >
Cc: Eric Chamberland mailto:eric.chamberl...@giref.ulaval.ca> >; Thomas Briffard 
mailto:thomas.briff...@michelin.com> >; Vivien 
Clauzon mailto:vivien.clau...@michelin.com> >; 
dave.mar...@giref.ulaval.ca  ; Ramses van 
Zon mailto:r...@scinet.utoronto.ca> >; 
charles.coulomb...@ulaval.ca  
Subject: [OMPI users] Segfault in ucp_dt_pack function from UCX library 1.8.0 
and 1.11.2 for large sized communications using both OpenMPI 4.0.3 and 4.1.2

 

Hi,

In the past, we have successfully launched large sized (finite elements) 
computations using PARMetis as mesh partitioner.

It was first in 2012 with OpenMPI (v2.?) and secondly in March 2019 with 
OpenMPI 3.1.2 that we succeeded.

Today, we have a bunch of nightly (small) tests running nicely and testing all 
of OpenMPI (4.0.x, 4.1.x and 5.0x), MPICH-3.3.2 and IntelMPI 2021.6.

Preparing for launching the same computation we did in 2012, and even larger 
ones, we compiled with bot OpenMPI 4.0.3+ucx-1.8.0 and OpenMPI 4.1.2+ucx-1.11.2 
and launched computation from small to large problems (meshes).

For small meshes, it goes fine.

But when we reach near 2^31 faces into the 3D mesh we are using and call 
ParMETIS_V3_PartMeshKway, we always get a segfault with the same backtrace 
pointing into ucx library:

Wed Jun  1 23:04:54 
2022:chrono::InterfaceParMetis::ParMETIS_V3_PartMeshKway::debut VmSize: 
1202304 VmRSS: 349456 VmPeak: 1211736 VmData: 500764 VmHWM: 359012  
Wed Jun  1 23:07:07 2022:Erreur    :  MEF++ Signal recu : 11 :  
segmentation violation  
Wed Jun  1 23:07:07 2022:Erreur    :   
Wed Jun  1 23:07:07 2022:-- (Début des 
informations destinées aux développeurs C++) --
Wed Jun  1 23:07:07 2022:La pile d'appels contient 27 symboles. 
Wed Jun  1 23:07:07 2022:# 000: 
reqBacktrace(std::__cxx11::basic_string, 
std::allocator >&)  >>>  probGD.opt 
(probGD.opt(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x71)
 [0x4119f1])
Wed Jun  1 23:07:07 2022:# 001: attacheDebugger()  >>>  probGD.opt 
(probGD.opt(_Z15attacheDebuggerv+0x29a) [0x41386a])
Wed Jun  1 23:07:07 2022:# 002: 
/gpfs/fs0/project/d/deteix/ericc/GIREF/lib/libgiref_opt_Util.so(traitementSignal+0x1f9f)
 [0x2ab3aef0e5cf]
Wed Jun  1 23:07:07 2022:# 003: /lib64/libc.so.6(+0x36400) 
[0x2ab3bd59a400]
Wed Jun  1 23:07:07 2022:# 004: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(ucp_dt_pack+0x123)
 [0x2ab3c966e353]
Wed Jun  1 23:07:07 2022:# 005: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(+0x536b7)
 [0x2ab3c968d6b7]
Wed Jun  1 23:07:07 2022:# 006: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/ucx/libuct_ib.so.0(uct_dc_mlx5_ep_am_bcopy+0xd7)
 [0x2ab3ca712137]
Wed Jun  1 23:07:07 2022:# 007: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(+0x52d3c)
 [0x2ab3c968cd3c]
Wed Jun  1 23:07:07 2022:# 008: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(ucp_tag_send_nbx+0x5ad)
 [0x2ab3c9696dcd]
Wed Jun  1 23:07:07 2022:# 009: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xf2)
 [0x2ab3c922e0b2]
Wed Jun  1 23:07:07 2022:# 010: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0x92)
 [0x2ab3bbca5a32]
Wed Jun  1 23:07:07 2022:# 011: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/libmpi.so.40(ompi_coll_base_alltoallv_intra_pairwise+0x141)
 [0x2ab3bbcad941]
Wed Jun  1 23:07:07 2022:# 012: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_alltoallv_intra_dec_fixed+0x42)
 

Re: [OMPI users] PRRTE DVM: how to specify rankfile per prun invocation?

2021-01-11 Thread Josh Hursey via users
Thank you for the bug report. I filed a bug against PRRTE so this doesn't get 
lost that you can follow below:
  https://github.com/openpmix/prrte/issues/720

Making rankfle a per-job instead of a per-DVM option might require some 
internal plumbing work. So I'm not sure how quickly this will be resolved, but 
you can follow the status on that issue.


On Tue, Dec 15, 2020 at 8:40 PM Alexei Colin via users 
mailto:users@lists.open-mpi.org> > wrote:
Hi, is there a way to allocate more resources to rank 0 than to
any of the other ranks in the context of PRRTE DVM?

With mpirun (aka. prte) launcher, I can successfully accomplish this
using a rankfile:

    rank 0=+n0 slot=0
    rank 1=+n1 slot=0
    rank 2=+n1 slot=1

    mpirun -n 3 --bind-to none --map-by ppr:2:node:DISPLAY \
        --mca prte_rankfile arankfile ./mpitest

       JOB MAP   
    Data for JOB [13205,1] offset 0 Total slots allocated 256
        Mapping policy: PPR:NO_USE_LOCAL,NOOVERSUBSCRIBE  Ranking policy:
        SLOT Binding policy: NONE
        Cpu set: N/A  PPR: 2:node  Cpus-per-rank: N/A  Cpu Type: CORE


        Data for node: nid03828     Num slots: 64   Max slots: 0 Num procs: 1
            Process jobid: [13205,1] App: 0 Process rank: 0
            Bound: package[0][core:0]

        Data for node: nid03829     Num slots: 64   Max slots: 0    Num procs: 2
            Process jobid: [13205,1] App: 0 Process rank: 1 Bound: 
package[0][core:0]
            Process jobid: [13205,1] App: 0 Process rank: 2 Bound: 
package[0][core:1]

    =

But, I cannot achieve this with explicit prte; prun; pterm.  It looks
like rankfile is associated with the DVM as opposed to each prun
instance. I can do this: 

    prte --mca prte_rankfile arankfile
    prun ...
    pterm

But it's not useful for running multiple unrelated prun jobs in the same

DVM that each have a different rank count and ranks-per-node
(ppr:N:node) count, so need their own mapping policy in own rankfiles.
(Multiple pruns in same DVM are needed to pack multiple subjobs into one

resource manager job, in which one DVM spans the full allocation.)

The user-specified rankfile is applied to the prte_rankfile global var
by rmaps rank_file component, but that component is not loaded by prun
(only by prte, i.e.  the DVM owner). Also, prte_ras_base_allocate
processes prte_rankfile global but it is not called by prun.  Would a
patch/hack to somehow make these components load for prun be doable
for me to hack together?  The 'mapping' is happening per prun instance,
correct? so is it just a matter of loading the rank file, or is there
deeper architectural obstacles?


Separate questions:

2. prun man page mentions --rankfile and contains a section about
rankfiles. But the arg is not valid:

    prun -n 3 --rankfile arankfile ./mpitest
    prun: Error: unknown option "--rankfile"

But, the manpage for prte (aka. mpirun) does not mention rankfiles, and
does not mention the only way I found to specify rankfile: prte/mpirun
--mca prte_rankfile arankfile.

Do you want a PR moving rankfile section from the prun manpage to
the prte manpage and mentioning the MCA parameter as a means to
specify a rankfile?

P.S. Btw, --pmca and --gpmca in prun manpage are also not accepted.


3. How to provide a "default slot_list" in order to not require the
rankfile to enumerate every rank? (exactly the question asked here [1])

For example, ommitting the line for rank 2 results in this error:

    rank 0=+n0 slot=0
    rank 1=+n1 slot=0

    mpirun -n 3 --bind-to none --map-by ppr:2:node:DISPLAY \
        --mca prte_rankfile arankfile ./mpitest

    A rank is missing its location specification:

    Rank:        2
    Rank file:   arankfile

    All processes must have their location specified in the rank file.
    Either add an entry to the file, or provide a default slot_list to
    use for any unspecified ranks.


3. Is there a way to use a rankfile but not bind to cores? Omitting
'slot' from lines in rankfile is rejected:

    rank 0=+n0
    rank 1=+n1
    rank 2=+n1

Binding is orthogonal to mapping, correct? Would support rankfiles
without 'slot' be doable for me to quickly patch in?

Rationale: binding causes the following error with --map-by ppr:N:node:

    mpirun -n 3 --map-by ppr:2:node:DISPLAY \
        --mca prte_rankfile arankfile ./mpitest

    The request to bind processes could not be completed due to
    an internal error - the locale of the following process was
    not set by the mapper code:

      Process:  [[33295,1],0]

    Please contact the OMPI developers for assistance. Meantime,
    you will still be able to run your application without binding
    by specifying "--bind-to none" on your command line.

Adding '--bind-to none' eliminates the error, but the JOB MAP reports
that processes are bound, which is correct w.r.t. the rankfile but
contradictory to --bind-to none:

    

Re: [OMPI users] [ORTE] Connecting back to parent - Forcing tcp port

2021-01-07 Thread Josh Hursey via users
I posted a fix for the static ports issue (currently on the v4.1.x branch):
  https://github.com/open-mpi/ompi/pull/8339

If you have time do you want to give it a try and confirm that it fixes your 
issue?

Thanks,
Josh


On Tue, Dec 22, 2020 at 2:44 AM Vincent mailto:boubl...@yahoo.co.uk> > wrote:
 
 
On 18/12/2020 23:04, Josh Hursey wrote:
 
 
 
Vincent,
 

 
 
Thanks for the details on the bug. Indeed this is a case that seems to have 
been a problem for a little while now when you use static ports with ORTE (-mca 
oob_tcp_static_ipv4_ports option). It must have crept in when we refactored the 
internal regular expression mechanism for the v4 branches (and now that I look 
maybe as far back as v3.1). I just hit this same issue in the past day or so 
working with a different user.
 

 
 
Though I do not have a suggestion for a workaround at this time (sorry) I did 
file a GitHub Issue and am looking at this issue. With the holiday I don't know 
when I will have a fix, but you can watch the ticket for updates.
 
  https://github.com/open-mpi/ompi/issues/8304
 

 
 
In the meantime, you could try the v3.0 series release (which predates this 
change) or the current Open MPI master branch (which approaches this a little 
differently). The same command line should work in both. Both can be downloaded 
from the links below:
 
  https://www.open-mpi.org/software/ompi/v3.0/
 
  https://www.open-mpi.org/nightly/master/
 
 Hello Josh
 
 Thank you for considering the problem. I will certainly keep watching the 
ticket. However, there is nothing really urgent (to me anyway).
 
 

 
 

 
 
Regarding your command line, it looks pretty good:
 
  orterun --launch-agent /home/boubliki/openmpi/bin/orted -mca btl tcp --mca 
btl_tcp_port_min_v4 6706 --mca btl_tcp_port_range_v4 10 --mca 
oob_tcp_static_ipv4_ports 6705 -host node2:1 -np 1 /path/to/some/program arg1 
.. argn
 

 
 
I would suggest, while you are debugging this, that you use a program like 
/bin/hostname instead of a real MPI program. If /bin/hostname launches properly 
then move on to an MPI program. That will assure you that the runtime wired up 
correctly (oob/tcp), and then we can focus on the MPI side of the communication 
(btl/tcp). You will want to change "-mca btl tcp" to at least "-mca btl 
tcp,self" (or better "-mca btl tcp,vader,self" if you want shared memory). 
'self' is the loopback interface in Open MPI.
 
 Yes. This is actually what I did. I just wanted to be generic and report the 
problem without too much flourish.
 But it is important you reminded this for new users, helping them to 
understand the real purpose of each layer in an MPI implementation.
 
 
 

 
 
Is there a reason that you are specifying the --launch-agent to the orted? Is 
it installed in a different path on the remote nodes? If Open MPI is installed 
in the same location on all nodes then you shouldn't need that.
 
 I recompiled the sources, activating --enable-orterun-prefix-by-default when 
running ./configure. Of course, it helps :)
 
 Again, thank you.
 
 Kind regards
 
 Vincent.
 
 
 

 
 

 
 
Thanks,
 
Josh
 
 


-- 
Josh Hursey
IBM Spectrum MPI Developer


Re: [OMPI users] [ORTE] Connecting back to parent - Forcing tcp port

2020-12-18 Thread Josh Hursey via users
Vincent,

Thanks for the details on the bug. Indeed this is a case that seems to have 
been a problem for a little while now when you use static ports with ORTE (-mca 
oob_tcp_static_ipv4_ports option). It must have crept in when we refactored the 
internal regular expression mechanism for the v4 branches (and now that I look 
maybe as far back as v3.1). I just hit this same issue in the past day or so 
working with a different user.

Though I do not have a suggestion for a workaround at this time (sorry) I did 
file a GitHub Issue and am looking at this issue. With the holiday I don't know 
when I will have a fix, but you can watch the ticket for updates.
  https://github.com/open-mpi/ompi/issues/8304

In the meantime, you could try the v3.0 series release (which predates this 
change) or the current Open MPI master branch (which approaches this a little 
differently). The same command line should work in both. Both can be downloaded 
from the links below:
  https://www.open-mpi.org/software/ompi/v3.0/
  https://www.open-mpi.org/nightly/master/


Regarding your command line, it looks pretty good:
  orterun --launch-agent /home/boubliki/openmpi/bin/orted -mca btl tcp --mca 
btl_tcp_port_min_v4 6706 --mca btl_tcp_port_range_v4 10 --mca 
oob_tcp_static_ipv4_ports 6705 -host node2:1 -np 1 /path/to/some/program arg1 
.. argn

I would suggest, while you are debugging this, that you use a program like 
/bin/hostname instead of a real MPI program. If /bin/hostname launches properly 
then move on to an MPI program. That will assure you that the runtime wired up 
correctly (oob/tcp), and then we can focus on the MPI side of the communication 
(btl/tcp). You will want to change "-mca btl tcp" to at least "-mca btl 
tcp,self" (or better "-mca btl tcp,vader,self" if you want shared memory). 
'self' is the loopback interface in Open MPI.

Is there a reason that you are specifying the --launch-agent to the orted? Is 
it installed in a different path on the remote nodes? If Open MPI is installed 
in the same location on all nodes then you shouldn't need that.


Thanks,
Josh



On Wed, Dec 16, 2020 at 9:23 AM Vincent Letocart via users 
mailto:users@lists.open-mpi.org> > wrote:
 
 
 Good morning
 
 I am facing a tuning problem while playing with the orterun command in order 
to set a tcp port within a specific range.
 A part of this can be I'm not very familiar with the architecture of the 
software and I sometimes struggle through the documentation.
 
 Here is what I'm trying to do (problem has been here reduced to launching a 
single task on ··one·· remote node):
 orterun --launch-agent /home/boubliki/openmpi/bin/orted -mca btl tcp --mca 
btl_tcp_port_min_v4 6706 --mca btl_tcp_port_range_v4 10 --mca 
oob_tcp_static_ipv4_ports 6705 -host node2:1 -np 1 /path/to/some/program arg1 
.. argn
 Those mca options are highlighted here and there in various mailing-lists or 
archives on the net. Version is 4.0.5.
 
 I tried different combinations like
 only --mca btl_tcp_port_min_v4 6706 --mca btl_tcp_port_range_v4 10 (then 
--report-uri shows a randomly picked up tcp port number)
 or adding --mca oob_tcp_static_ipv4_ports 6705  (then --report-uri report the 
tcp port I specified and everything crashes)
 or many others
 but the result becomes:
 [node2:4050181] *** Process received signal ***
 [node2:4050181] Signal: Segmentation fault (11)
 [node2:4050181] Signal code: Address not mapped (1)
 [node2:4050181] Failing at address: (nil)
 [node2:4050181] [ 0] /lib64/libpthread.so.0(+0x12dd0)[0x7fdaf95a9dd0]
 [node2:4050181] *** End of error message ***
 bash: line 1: 4050181 Segmentation fault  (core dumped) 
/home/boubliki/openmpi/bin/orted -mca ess "env" -mca ess_base_jobid 
"1254293504" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca 
orte_node_regex "node[1:1,2]@0(2)" -mca btl "tcp" --mca btl_tcp_port_min_v4 
"6706" --mca btl_tcp_port_range_v4 "10" --mca oob_tcp_static_ipv4_ports "6705" 
-mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri 
"1254293504.0;tcp://192.168.xxx.xxx:6705" -mca orte_launch_agent 
"/home/boubliki/openmpi/bin/orted" -mca pmix "^s1,s2,cray,isolated"
 I tried on different machines, and also with different compilers (gcc 10.2 and 
intel 19u1). Version 4.1.0rc5 did not improve the execution. Forcing no 
optimization with -O0 neither.
 
 Not familiar with debugging such a software but I could add a lantency 
somewhere (sleep()) and catch the orted process on the [single] remote node, 
reaching line 572 with gdb
 
 boubliki@node1: ~/openmpi/src/openmpi-4.0.5> cat -n 
orte/mca/ess/base/ess_base_std_orted.c | sed -n -r -e '562,583p'
    562  if (orte_static_ports || orte_fwd_mpirun_port) {
    563  if (NULL == orte_node_regex) {
    564  /* we didn't get the node info */
    565  error = "cannot construct daemon map for static ports - no 
node map info";
    566  goto error;
    567  }
    568  /* extract the node info from