Re: [OMPI users] libnuma.so error

2023-07-20 Thread Gus Correa via users
Hi Luis

That's awkward, because if the numa/libnuma packages were properly
installed,
the softlink should have been created.
Maybe check with "yum list |grep numa", then if something is missing use
"yum installl ...".
[Anyway, maybe the compute nodes use a different mechanism to pull their
system image, separate from yum/dnf/apt/]

Gus

On Thu, Jul 20, 2023 at 4:00 AM Luis Cebamanos via users <
users@lists.open-mpi.org> wrote:

> Hi Gus,
>
> Yeap, I can see softlink is missing on the compute nodes.
>
> Thanks!
> Luis
>
> On 19/07/2023 17:42, Gus Correa via users wrote:
>
> If it is installed, libunuma should be in:
> /usr/lib64/libnuma.so
> as a softlink to the actual number-versioned  library.
> In general the loader is configured to search for shared libraries
> in /usr/lib64 ("ldd " may shed some light here).
>
> You can check if the numa packages are installed with:
> yum list | grep numa (CentOS 7, RHEL 7)
> dnf list | grep numa (CentOS 8, RHEL 8, RockyLinux 8, Fedora, etc)
> apt list | grep numa (Debian, Ubuntu)
>
> If not, you can install (or ask the system administrator to do it).
>
> I hope this helps,
> Gus Correa
>
>
> On Wed, Jul 19, 2023 at 11:55 AM Jeff Squyres (jsquyres) via users <
> users@lists.open-mpi.org> wrote:
>
>> It's not clear if that message is being emitted by Open MPI.
>>
>> It does say it's falling back to a different behavior if libnuma.so is
>> not found, so it appears if it's treating it as a warning, not an error.
>> --
>> *From:* users  on behalf of Luis
>> Cebamanos via users 
>> *Sent:* Wednesday, July 19, 2023 10:09 AM
>> *To:* users@lists.open-mpi.org 
>> *Cc:* Luis Cebamanos 
>> *Subject:* [OMPI users] libnuma.so error
>>
>> Hello,
>>
>> I was wondering if anyone has ever seen the following runtime error:
>>
>> mpirun -np 32 ./hello
>> .
>> [LOG_CAT_SBGP] libnuma.so: cannot open shared object file: No such file
>> or directory
>> [LOG_CAT_SBGP] Failed to dlopen libnuma.so. Fallback to GROUP_BY_SOCKET
>> manual.
>> .
>>
>> The funny thing is that the binary is executed despite the errors.
>> What could be causing it?
>>
>> Regards,
>> Lusi
>>
>
>


Re: [OMPI users] libnuma.so error

2023-07-19 Thread Gus Correa via users
If it is installed, libunuma should be in:
/usr/lib64/libnuma.so
as a softlink to the actual number-versioned  library.
In general the loader is configured to search for shared libraries
in /usr/lib64 ("ldd " may shed some light here).

You can check if the numa packages are installed with:
yum list | grep numa (CentOS 7, RHEL 7)
dnf list | grep numa (CentOS 8, RHEL 8, RockyLinux 8, Fedora, etc)
apt list | grep numa (Debian, Ubuntu)

If not, you can install (or ask the system administrator to do it).

I hope this helps,
Gus Correa


On Wed, Jul 19, 2023 at 11:55 AM Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> It's not clear if that message is being emitted by Open MPI.
>
> It does say it's falling back to a different behavior if libnuma.so is not
> found, so it appears if it's treating it as a warning, not an error.
> --
> *From:* users  on behalf of Luis
> Cebamanos via users 
> *Sent:* Wednesday, July 19, 2023 10:09 AM
> *To:* users@lists.open-mpi.org 
> *Cc:* Luis Cebamanos 
> *Subject:* [OMPI users] libnuma.so error
>
> Hello,
>
> I was wondering if anyone has ever seen the following runtime error:
>
> mpirun -np 32 ./hello
> .
> [LOG_CAT_SBGP] libnuma.so: cannot open shared object file: No such file
> or directory
> [LOG_CAT_SBGP] Failed to dlopen libnuma.so. Fallback to GROUP_BY_SOCKET
> manual.
> .
>
> The funny thing is that the binary is executed despite the errors.
> What could be causing it?
>
> Regards,
> Lusi
>


Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-07 Thread Gus Correa via users
This may have changed since, but these used to be relevant points.
Overall, the Open MPI FAQ have lots of good suggestions:
https://www.open-mpi.org/faq/
some specific for performance tuning:
https://www.open-mpi.org/faq/?category=tuning
https://www.open-mpi.org/faq/?category=openfabrics

1) Make sure you are not using the Ethernet TCP/IP, which is widely
available in compute nodes:

mpirun --mca btl self,sm,openib ...

https://www.open-mpi.org/faq/?category=tuning#selecting-components

However, this may have changed lately:
https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable

2) Maximum locked memory used by IB and their system limit. Start here:
https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage

3) The eager vs. rendezvous message size threshold.
I wonder if it may sit right where you see the latency spike.
https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user

4) Processor and memory locality/affinity and binding (please check
the current options and syntax)
https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4


On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users <
users@lists.open-mpi.org> wrote:

> Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php
>
> mpirun --verbose --display-map
>
> Have you tried newer OpenMPI versions?
>
> Do you get similar behavior for the osu_reduce and osu_gather benchmarks?
>
> Typically internal buffer sizes as well as your hardware will affect
> performance. Can you give specifications similar to what is available at:
> http://mvapich.cse.ohio-state.edu/performance/collectives/
> where the operating system, switch, node type and memory are indicated.
>
> If you need good performance, may want to also specify the algorithm
> used. You can find some of the parameters you can tune using:
>
> ompi_info --all
>
> A particular helpful parameter is:
>
> MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current
> value: "ignore", data source: default, level: 5 tuner/detail, type: int)
>Which allreduce algorithm is used. Can be
> locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping (tuned
> reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring
>Valid values: 0:"ignore", 1:"basic_linear",
> 2:"nonoverlapping", 3:"recursive_doubling", 4:"ring",
> 5:"segmented_ring", 6:"rabenseifner"
>MCA coll tuned: parameter
> "coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data
> source: default, level: 5 tuner/detail, type: int)
>
> For OpenMPI 4.0, there is a tuning program [2] that might also be helpful.
>
> [1]
>
> https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi
> [2] https://github.com/open-mpi/ompi-collectives-tuning
>
> On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote:
> > Hi
> >
> > When i repeat i always got the huge discrepancy at the
> >
> > message size of 16384.
> >
> > May be there is a way to run mpi in verbose mode in order
> >
> > to further investigate this behaviour?
> >
> > Best
> >
> > Denis
> >
> > 
> > *From:* users  on behalf of Benson
> > Muite via users 
> > *Sent:* Monday, February 7, 2022 2:27:34 PM
> > *To:* users@lists.open-mpi.org
> > *Cc:* Benson Muite
> > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband
> > network
> > Hi,
> > Do you get similar results when you repeat the test? Another job could
> > have interfered with your run.
> > Benson
> > On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote:
> >> Hi
> >>
> >> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to
> >> check/benchmark
> >>
> >> the infiniband network for our cluster.
> >>
> >> For that i use the collective all_reduce benchmark and run over 200
> >> nodes, using 1 process per node.
> >>
> >> And this is the results i obtained 
> >>
> >>
> >>
> >> 
> >>
> >> # OSU MPI Allreduce Latency Test v5.7.1
> >> # Size   Avg Latency(us)   Min Latency(us)   Max Latency(us)
> Iterations
> >> 4 114.65 83.22147.98
> 1000
> >> 8 133.85106.47164.93
> 1000
> >> 16116.41 87.57150.58
> 1000
> >> 32112.17 93.25130.23
> 1000
> >> 64106.85 81.93134.74
> 1000
> >> 128   117.53 87.50152.27
> 1000
> >> 256   143.08115.63173.97
> 1000
> >> 512   130.34100.20167.56
> 1000
> >> 1024  155.67111.29188.20
> 1000
> >> 2048  151.82116.03198.19
> 1000
> >> 4096  159.11   

Re: [OMPI users] stdout scrambled in file

2021-12-05 Thread Gus Correa via users
Hi Mark

Back in the days I liked the mpirun/mpiexec *--tag-output *option.
Jeff: Does it still exist?
It may not prevent 100% the splitting of output lines,
but tagging the lines with the process rank helps.
You can grep the stdout log for the rank that you want,
which helps a lot when several processes are talking.

I hope this helps,
Gus Correa


On Sun, Dec 5, 2021 at 1:12 PM Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> FWIW: Open MPI 4.1.2 has been released -- you can probably stop using an
> RC release.
>
> I think you're probably running into an issue that is just a fact of
> life.  Especially when there's a lot of output simultaneously from multiple
> MPI processes (potentially on different nodes), the stdout/stderr lines can
> just get munged together.
>
> Can you check for convergence a different way?
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> 
> From: users  on behalf of Fisher (US),
> Mark S via users 
> Sent: Thursday, December 2, 2021 10:48 AM
> To: users@lists.open-mpi.org
> Cc: Fisher (US), Mark S
> Subject: [OMPI users] stdout scrambled in file
>
> We are using Mellanox HPC-X MPI based on OpenMPI 4.1.1RC1 and having
> issues with lines scrambling together occasionally. This causes issues our
> convergence checking code since we put convergence data there. We are not
> using any mpirun options for stdout we just redirect stdout/stderr to a
> file before we run the mpirun command so all output goes there. We had
> similar issue with Intel MPI in the past and used the -ordered-output to
> fix it but I do not see any similar option for OpenMPI. See example below.
> Is there anyway to ensure a line from a process gets one line in the output
> file?
>
>
> The data in red below is scrambled up and should look like the cleaned-up
> version. You can see it put a line from a different process inside a line
> from another processes and the rest of the line ended up a couple of lines
> down.
>
> ZONE   0 : Min/Max CFL= 5.000E-01 1.500E+01 Min/Max DT= 8.411E-10
> 1.004E-01 sec
>
> *IGSTAB* 1626 7.392E-02 2.470E-01 -9.075E-04 8.607E-03 -5.911E-04
> -4.945E-06  aerosurfs
> *IGMNTAERO* 1626 -6.120E-04 1.406E-02 6.395E-04 4.473E-08 3.112E-04
> -2.785E-05  aerosurfs
> *IGSTAB* 1626 7.392E-02 2.470E-01 -9.075E-04 8.607E-03 -5.911E-04
> -4.945E-06  Aircraft-Total
> *IGMNTAERO* 1626 -6.120E-04 1.406E-02 6.395E-04 4.473E-08 3.112E-04
> -2.785E-05  Aircr Warning: BCFD: US_UPDATEQ: izon, iter, nBadpmin:  699
> 1625 12
> Warning: BCFD: US_UPDATEQ: izon, iter, nBadpmin:  111  1626  6
> aft-Total
> *IGSTAB* 1626 6.623E-02 2.137E-01 -9.063E-04 8.450E-03 -5.485E-04
> -4.961E-06  Aircraft-OML
> *IGMNTAERO* 1626 -6.118E-04 -1.602E-02 6.404E-04 5.756E-08 3.341E-04
> -2.791E-05  Aircraft-OML
>
>
> Cleaned up version:
>
> ZONE   0 : Min/Max CFL= 5.000E-01 1.500E+01 Min/Max DT= 8.411E-10
> 1.004E-01 sec
>
> *IGSTAB* 1626 7.392E-02 2.470E-01 -9.075E-04 8.607E-03 -5.911E-04
> -4.945E-06  aerosurfs
> *IGMNTAERO* 1626 -6.120E-04 1.406E-02 6.395E-04 4.473E-08 3.112E-04
> -2.785E-05  aerosurfs
> *IGSTAB* 1626 7.392E-02 2.470E-01 -9.075E-04 8.607E-03 -5.911E-04
> -4.945E-06  Aircraft-Total
> *IGMNTAERO* 1626 -6.120E-04 1.406E-02 6.395E-04 4.473E-08 3.112E-04
> -2.785E-05  Aircraft-Total
>  Warning: BCFD: US_UPDATEQ: izon, iter, nBadpmin:  699  1625 12
> Warning: BCFD: US_UPDATEQ: izon, iter, nBadpmin:  111  1626  6
> *IGSTAB* 1626 6.623E-02 2.137E-01 -9.063E-04 8.450E-03 -5.485E-04
> -4.961E-06  Aircraft-OML
> *IGMNTAERO* 1626 -6.118E-04 -1.602E-02 6.404E-04 5.756E-08 3.341E-04
> -2.791E-05  Aircraft-OML
>
> Thanks!
>


Re: [OMPI users] Error with building OMPI with PGI

2021-01-14 Thread Gus Correa via users
Hi Passant, list

This is an old problem with PGI.
There are many threads in the OpenMPI mailing list archives about this,
with workarounds.
The simplest is to use FC="pgf90 -noswitcherror".

Here are two out of many threads ... well,  not pthreads!  :)
https://www.mail-archive.com/users@lists.open-mpi.org/msg08962.html
https://www.mail-archive.com/users@lists.open-mpi.org/msg10375.html

I hope this helps,
Gus Correa

On Thu, Jan 14, 2021 at 5:45 PM Passant A. Hafez via users <
users@lists.open-mpi.org> wrote:

> Hello,
>
>
> I'm having an error when trying to build OMPI 4.0.3 (also tried 4.1) with
> PGI 20.1
>
>
> ./configure CPP=cpp CC=pgcc CXX=pgc++ F77=pgf77 FC=pgf90
> --prefix=$PREFIX --with-ucx=$UCX_HOME --with-slurm
> --with-pmi=/opt/slurm/cluster/ibex/install --with-cuda=$CUDATOOLKIT_HOME
>
>
> in the make install step:
>
> make[4]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x'
> make[3]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x'
> make[2]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x'
> Making install in mca/pmix/s1
> make[2]: Entering directory `/tmp/openmpi-4.0.3/opal/mca/pmix/s1'
>   CCLD mca_pmix_s1.la
> pgcc-Error-Unknown switch: -pthread
> make[2]: *** [mca_pmix_s1.la] Error 1
> make[2]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/s1'
> make[1]: *** [install-recursive] Error 1
> make[1]: Leaving directory `/tmp/openmpi-4.0.3/opal'
> make: *** [install-recursive] Error 1
>
> Please advise.
>
>
>
>
> All the best,
> Passant
>


Re: [OMPI users] 4.0.5 on Linux Pop!_OS

2020-11-07 Thread Gus Correa via users
>> Core(s) per socket:  8

> "4. If none of a hostfile, the --host command line parameter, or an RM is
> present, Open MPI defaults to the number of processor cores"

Have you tried -np 8?


On Sun, Nov 8, 2020 at 12:25 AM Paul Cizmas via users <
users@lists.open-mpi.org> wrote:

> Gilles:
>
> Thank you for your reply.  Unfortunately, it did not quite help me.
>
> As I said in my e-mail, I can run this on a Mac by only specifying
>
> $mympirun -np 12  $exe input1
>
> without worrying about “slots”.
>
> So, my questions are:
>
> 1. Why do I need “slot” on the Linux?
>
> 2. Is there a relation between slots, sockets, cores and threads?  The
> workstation has 1 socket, 8 cores per socket and 2 threads per core, or 16
> CPUs.  How many slots are there?
>
> 3. If I need to specify “slot”, what is the syntax?
>
> I tried:
>
> $mympirun -np 12 slots=12 $exe input1
>
> and got:
> ==
> No protocol specified
> --
> There are not enough slots available in the system to satisfy the 12
> slots that were requested by the application:
>
>  slots=12
>
> Either request fewer slots for your application, or make more slots
> available for use.
> ==
> Finally, I made it work by using
>
> $mympirun -np 12 --use-hwthread-cpus $exe input1
>
> and ignored all the slot options, so I missed the chance to learn about
> slots.
>
> I did not find an example on how to specify the “slot” although the
> message lists four options - four options but zero examples.
>
> Thank you,
> Paul
>
> > On Nov 7, 2020, at 8:23 PM, Gilles Gouaillardet via users <
> users@lists.open-mpi.org> wrote:
> >
> > Paul,
> >
> > a "slot" is explicitly defined in the error message you copy/pasted:
> >
> > "If none of a hostfile, the --host command line parameter, or an RM is
> > present, Open MPI defaults to the number of processor cores"
> >
> > The error message also lists 4 ways on how you can move forward, but
> > you should first ask yourself if you really want to run 12 MPI tasks
> > on your machine.
> >
> > Cheers,
> >
> > Gilles
> >
> > On Sun, Nov 8, 2020 at 11:14 AM Paul Cizmas via users
> >  wrote:
> >>
> >> Hello:
> >>
> >> I just installed OpenMPI 4.0.5 on a Linux machine running Pop!_OS (made
> by System76).  The workstation has the following architecture:
> >>
> >> Architecture:x86_64
> >> CPU op-mode(s):  32-bit, 64-bit
> >> Byte Order:  Little Endian
> >> Address sizes:   39 bits physical, 48 bits virtual
> >> CPU(s):  16
> >> On-line CPU(s) list: 0-15
> >> Thread(s) per core:  2
> >> Core(s) per socket:  8
> >> Socket(s):   1
> >> NUMA node(s):1
> >> Vendor ID:   GenuineIntel
> >> CPU family:  6
> >>
> >> I am trying to run on the Linux box a code that I usually run on a Mac
> OS without any issues.
> >>
> >> The script that I use is:
> >>
> >> exe='/usr/bin/mycode' # on jp2
> >> mympirun='/opt/openmpi/4.0.5/bin/mpirun'   # GFortran on jp2
> >> $mympirun -np 12  $exe input1
> >>
> >> I get the following error:
> >> 
> >> No protocol specified
> >>
> --
> >> There are not enough slots available in the system to satisfy the 12
> >> slots that were requested by the application:
> >>
> >> /usr/bin/mycode
> >>
> >> Either request fewer slots for your application, or make more slots
> >> available for use.
> >>
> >> A "slot" is the Open MPI term for an allocatable unit where we can
> >> launch a process.  The number of slots available are defined by the
> >> environment in which Open MPI processes are run:
> >>
> >> 1. Hostfile, via "slots=N" clauses (N defaults to number of
> >>processor cores if not provided)
> >> 2. The --host command line parameter, via a ":N" suffix on the
> >>hostname (N defaults to 1 if not provided)
> >> 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
> >> 4. If none of a hostfile, the --host command line parameter, or an
> >>RM is present, Open MPI defaults to the number of processor cores
> >>
> >> In all the above cases, if you want Open MPI to default to the number
> >> of hardware threads instead of the number of processor cores, use the
> >> --use-hwthread-cpus option.
> >>
> >> Alternatively, you can use the --oversubscribe option to ignore the
> >> number of available slots when deciding the number of processes to
> >> launch.
> >> ===
> >>
> >> I do not understand “slots”.  The architecture description of my Linux
> box lists sockets, cores and threads, but not slots.
> >>
> >> What shall I specify instead of "-np 12”?
> 

Re: [OMPI users] mpirun on Kubuntu 20.4.1 hangs

2020-10-21 Thread Gus Correa via users
Hi Jorge

You may have an active firewall protecting either computer or both,
and preventing mpirun to start the connection.
Your /etc/hosts file may also not have the computer IP addresses.
You may also want to try the --hostfile option.
Likewise, the --verbose option may also help diagnose the problem.

It would help if you send the mpirun command line, the hostfile (if any),
error message if any, etc.


These FAQs may help diagnose and solve the problem:

https://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems
https://www.open-mpi.org/faq/?category=running#mpirun-hostfile
https://www.open-mpi.org/faq/?category=running

I hope this helps,
Gus Correa

On Tue, Oct 20, 2020 at 4:47 PM Jorge SILVA via users <
users@lists.open-mpi.org> wrote:

> Hello,
>
> I installed kubuntu20.4.1 with openmpi 4.0.3-0ubuntu in two different
> computers in the standard way. Compiling with mpif90 works, but mpirun
> hangs with no output in both systems. Even mpirun command without
> parameters hangs and only twice ctrl-C typing can end the sleeping
> program. Only  the command
>
>  mpirun --help
>
> gives the usual output.
>
> Seems that is something related to the terminal output, but the command
> worked well for Kubuntu 18.04. Is there a way to debug or fix this
> problem (without re-compiling from sources, etc)? Is it a known problem?
>
> Thanks,
>
>   Jorge
>
>


Re: [OMPI users] Code failing when requesting all "processors"

2020-10-13 Thread Gus Correa via users
Can you use taskid after MPI_Finalize?
Isn't it undefined/deallocated at that point?
Just a question (... or two) ...

Gus Correa

>  MPI_Finalize();
>
>  printf("END OF CODE from task %d\n", taskid);





On Tue, Oct 13, 2020 at 10:34 AM Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> That's odd.  What version of Open MPI are you using?
>
>
> > On Oct 13, 2020, at 6:34 AM, Diego Zuccato via users <
> users@lists.open-mpi.org> wrote:
> >
> > Hello all.
> >
> > I have a problem on a server: launching a job with mpirun fails if I
> > request all 32 CPUs (threads, since HT is enabled) but succeeds if I
> > only request 30.
> >
> > The test code is really minimal:
> > -8<--
> > #include "mpi.h"
> > #include 
> > #include 
> > #define  MASTER 0
> >
> > int main (int argc, char *argv[])
> > {
> >  int   numtasks, taskid, len;
> >  char hostname[MPI_MAX_PROCESSOR_NAME];
> >  MPI_Init(, );
> > //  int provided=0;
> > //  MPI_Init_thread(, , MPI_THREAD_MULTIPLE, );
> > //printf("MPI provided threads: %d\n", provided);
> >  MPI_Comm_size(MPI_COMM_WORLD, );
> >  MPI_Comm_rank(MPI_COMM_WORLD,);
> >
> >  if (taskid == MASTER)
> >printf("This is an MPI parallel code for Hello World with no
> > communication\n");
> >  //MPI_Barrier(MPI_COMM_WORLD);
> >
> >
> >  MPI_Get_processor_name(hostname, );
> >
> >  printf ("Hello from task %d on %s!\n", taskid, hostname);
> >
> >  if (taskid == MASTER)
> >printf("MASTER: Number of MPI tasks is: %d\n",numtasks);
> >
> >  MPI_Finalize();
> >
> >  printf("END OF CODE from task %d\n", taskid);
> > }
> > -8<--
> > (the commented section is a leftover of one of the tests).
> >
> > The error is :
> > -8<--
> > [str957-bl0-03:19637] *** Process received signal ***
> > [str957-bl0-03:19637] Signal: Segmentation fault (11)
> > [str957-bl0-03:19637] Signal code: Address not mapped (1)
> > [str957-bl0-03:19637] Failing at address: 0x77fac008
> > [str957-bl0-03:19637] [ 0]
> > /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x77e92730]
> > [str957-bl0-03:19637] [ 1]
> >
> /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7646d936]
> > [str957-bl0-03:19637] [ 2]
> >
> /usr/lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x76444733]
> > [str957-bl0-03:19637] [ 3]
> >
> /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7646d5b4]
> > [str957-bl0-03:19637] [ 4]
> >
> /usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7659346e]
> > [str957-bl0-03:19637] [ 5]
> >
> /usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7654b88d]
> > [str957-bl0-03:19637] [ 6]
> > /usr/lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x76507d7c]
> > [str957-bl0-03:19637] [ 7]
> >
> /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x76603fe4]
> > [str957-bl0-03:19637] [ 8]
> >
> /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x77fb1656]
> > [str957-bl0-03:19637] [ 9]
> >
> /usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x77c1c11a]
> > [str957-bl0-03:19637] [10]
> >
> /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x77eece62]
> > [str957-bl0-03:19637] [11]
> > /usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Init+0x6e)[0x77f1b17e]
> > [str957-bl0-03:19637] [12] ./mpitest-debug(+0x11c6)[0x51c6]
> > [str957-bl0-03:19637] [13]
> > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x77ce309b]
> > [str957-bl0-03:19637] [14] ./mpitest-debug(+0x10da)[0x50da]
> > [str957-bl0-03:19637] *** End of error message ***
> > -8<--
> >
> > I'm using Debian stable packages. On other servers there is no problem
> > (but there was in the past, and it got "solved" by just installing gdb).
> >
> > Any hints?
> >
> > TIA
> >
> > --
> > Diego Zuccato
> > DIFA - Dip. di Fisica e Astronomia
> > Servizi Informatici
> > Alma Mater Studiorum - Università di Bologna
> > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> > tel.: +39 051 20 95786
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
>


Re: [OMPI users] MPI is still dominant paradigm?

2020-08-07 Thread Gus Correa via users
"The reports of MPI death are greatly exaggerated." [Mark Twain]

And so are the reports of Fortran death
(despite the efforts of many CS departments
to make their students Fortran- and C-illiterate).

IMHO the level of abstraction of MPI is adequate, and actually very well
designed.
Higher levels often make things too specific, less applicable to generic
code,
while dispersing into a plethora of niche libraries and modules
that are bound to become unmaintained, buggy, and obsolete.
(Where Python seems to be headed, following Perl.)

How many people speak Chinese?
How many people speak English?
How many people speak Spanish?
How many people speak Esperanto?



On Fri, Aug 7, 2020 at 8:28 AM Oddo Da via users 
wrote:

> Hello,
>
> This may be a bit of a longer post and I am not sure if it is even
> appropriate here but I figured I ask. There are no hidden agendas in it, so
> please treat it as "asking for opinions/advice", as opposed to judging or
> provoking.
>
> For the period between 2010 to 2017 I used to work in (buzzword alert!)
> "big data" (meaning Spark, HDFS, reactive stuff like Akka) but way before
> that in the early 2000s I used to write basic multithreaded C and some MPI
> code. I came back to HPC/academia two years ago and what struck me was that
> (for lack of better word) the field is still "stuck" (again, for lack of
> better word) on MPI. This itself may seem negative in this context,
> however, I am just stating my observation, which may be wrong.
>
> I like low level programming and I like being in control of what is going
> on but having had the experience in Spark and Akka, I kind of got spoiled.
> Yes, I understand that the latter has fault-tolerance (which is nice) and
> MPI doesn't (or at least, didn't when I played with in 1999-2005) but I
> always felt like MPI needed higher level abstractions as a CHOICE (not
> _only_ choice) laid over the bare metal offerings. The whole world has
> moved onto programming in patterns and higher level abstractions, why is
> the academic/HPC world stuck on bare metal, still? Yes, I understand that
> performance often matters and the higher up you go, the more performance
> loss you incur, however, there is also something to be said about developer
> time and ease of understanding/abstracting etc. etc.
>
> Be that as it may, I am working on a project now in the HPC world and I
> noticed that Open MPI has Java bindings (or should I say "interface"?).
> What is the state of those? Which JDK do they support? Most importantly,
> would it be a HUGE pipe dream to think about building patterns a-la Akka
> (or even mixing actual Akka implementation) on top of OpenMPI via this Java
> bridge? What would be involved on the OpenMPI side? I have time/interest in
> going this route if there would be any hope of coming up with something
> that would make my life (and future people coming into HPC/MPI) easier in
> terms of building applications. I am not saying MPI in C/C++/Fortran should
> go away, however, sometimes we don't need the low-level stuff to express a
> concept :-). It may also open a whole new world for people on large
> clusters...
>
> Thank you!
>


Re: [OMPI users] Moving an installation

2020-07-24 Thread Gus Correa via users
+1
In my experience moving software, especially something of the complexity of
(Open) MPI,
is much more troublesome (and often just useless frustration) and time
consuming than recompiling it.
Hardware, OS, kernel, libraries, etc, are unlikely to be compatible.
Gus Correa

On Fri, Jul 24, 2020 at 1:03 PM Ralph Castain via users <
users@lists.open-mpi.org> wrote:

> While possible, it is highly unlikely that your desktop version is going
> to be binary compatible with your cluster...
>
> On Jul 24, 2020, at 9:55 AM, Lana Deere via users <
> users@lists.open-mpi.org> wrote:
>
> I have open-mpi 4.0.4 installed on my desktop and my small test programs
> are working.
>
> I would like to migrate the open-mpi to a cluster and run a larger program
> there.  When moved, the open-mpi installation is in a different pathname
> than it was on my desktop and it doesn't seem to work any longer.  I can
> make the libraries visible via LD_LIBRARY_PATH but this seems
> insufficient.  Is there an environment variable which can be used to tell
> the open-mpi where it is installed?
>
> Is it mandatory to actually compile the release in the ultimate
> destination on each system where it will be used?
>
> Thanks.
>
> .. Lana (lana.de...@gmail.com)
>
>
>
>


Re: [OMPI users] mca_oob_tcp_recv_handler: invalid message type: 15

2019-12-10 Thread Gus Correa via users
Hi Guido

Your PATH and LD_LIBRARY_PATH seem to be inconsistent with each other:

PATH=$HOME/libraries/compiled_with_gcc-7.3.0/openmpi-4.0.2/bin:$PATH
LD_LIBRARY_PATH=/share/apps/gcc-7.3.0/lib64:$LD_LIBRARY_PATH
Hence, you may be mixing different versions of Open MPI.

It looks like you installer Open MPI 4.0.2 here:
/home/guido/libraries/compiled_with_gcc-7.3.0/openmpi-4.0.2/

Have you tried this instead?
LD_LIBRARY_PATH=$HOME/libraries/compiled_with_gcc-7.3.0/openmpi-4.0.2/lib:$LD_LIBRARY_PATH

I hope this helps,
Gus Correa

On Tue, Dec 10, 2019 at 4:40 PM Guido granda muñoz via users <
users@lists.open-mpi.org> wrote:

> Hello,
> I compiled the application now using  openmpi-4.0.2:
>
>  linux-vdso.so.1 =>  (0x7fffb23ff000)
> libhdf5.so.103 =>
> /home/guido/libraries/compiled_with_gcc-7.3.0/hdf5-1.10.5_serial/lib/libhdf5.so.103
> (0x2b3cd188c000)
> libz.so.1 => /lib64/libz.so.1 (0x2b3cd1e74000)
> libmpi_usempif08.so.40 =>
> /home/guido/libraries/compiled_with_gcc-7.3.0/openmpi-4.0.2/lib/libmpi_usempif08.so.40
> (0x2b3cd208a000)
> libmpi_usempi_ignore_tkr.so.40 =>
> /home/guido/libraries/compiled_with_gcc-7.3.0/openmpi-4.0.2/lib/libmpi_usempi_ignore_tkr.so.40
> (0x2b3cd22c)
> libmpi_mpifh.so.40 =>
> /home/guido/libraries/compiled_with_gcc-7.3.0/openmpi-4.0.2/lib/libmpi_mpifh.so.40
> (0x2b3cd24c7000)
> libmpi.so.40 =>
> /home/guido/libraries/compiled_with_gcc-7.3.0/openmpi-4.0.2/lib/libmpi.so.40
> (0x2b3cd2723000)
> libgfortran.so.4 => /share/apps/gcc-7.3.0/lib64/libgfortran.so.4
> (0x2b3cd2a55000)
> libm.so.6 => /lib64/libm.so.6 (0x2b3cd2dc3000)
> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x2b3cd3047000)
> libquadmath.so.0 => /share/apps/gcc-5.4.0/lib64/libquadmath.so.0
> (0x2b3cd325e000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x2b3cd349c000)
> libc.so.6 => /lib64/libc.so.6 (0x2b3cd36b9000)
> librt.so.1 => /lib64/librt.so.1 (0x2b3cd3a4e000)
> libdl.so.2 => /lib64/libdl.so.2 (0x2b3cd3c56000)
> libopen-rte.so.40 =>
> /home/guido/libraries/compiled_with_gcc-7.3.0/openmpi-4.0.2/lib/libopen-rte.so.40
> (0x2b3cd3e5b000)
> libopen-pal.so.40 =>
> /home/guido/libraries/compiled_with_gcc-7.3.0/openmpi-4.0.2/lib/libopen-pal.so.40
> (0x2b3cd411)
> libudev.so.0 => /lib64/libudev.so.0 (0x2b3cd4425000)
> libutil.so.1 => /lib64/libutil.so.1 (0x2b3cd4634000)
> /lib64/ld-linux-x86-64.so.2 (0x2b3cd166a000)
>
> and ran it like this:
>
> #!/bin/bash
> #PBS -l nodes=1:ppn=32
> #PBS -N mc_cond_0_h3
> #PBS -o mc_cond_0_h3.o
> #PBS -e mc_cond_0_h3.e
>
> PATH=$HOME/libraries/compiled_with_gcc-7.3.0/openmpi-4.0.2/bin:$PATH
> LD_LIBRARY_PATH=/share/apps/gcc-7.3.0/lib64:$LD_LIBRARY_PATH
> cd $PBS_O_WORKDIR
> mpirun -np 32 ./flash4
>
> and now I'm getting this error messages:
>
> --
>
> As of version 3.0.0, the "sm" BTL is no longer available in Open MPI.
>
>
> Efficient, high-speed same-node shared memory communication support in
>
> Open MPI is available in the "vader" BTL. To use the vader BTL, you
>
> can re-run your job with:
>
>
> mpirun --mca btl vader,self,... your_mpi_application
>
> --
>
> --
>
> A requested component was not found, or was unable to be opened. This
>
> means that this component is either not installed or is unable to be
>
> used on your system (e.g., sometimes this means that shared libraries
>
> that the component requires are unable to be found/loaded). Note that
>
> Open MPI stopped checking at the first component that it did not find.
>
>
> Host: compute-0-34.local
>
> Framework: btl
>
> Component: sm
>
> --
>
> --
>
> It looks like MPI_INIT failed for some reason; your parallel process is
>
> likely to abort. There are many reasons that a parallel process can
>
> fail during MPI_INIT; some of which are due to configuration or environment
>
> problems. This failure appears to be an internal failure; here's some
>
> additional information (which may only be relevant to an Open MPI
>
> developer):
>
>
> mca_bml_base_open() failed
>
> --> Returned "Not found" (-13) instead of "Success" (0)
>
> --
>
> [compute-0-34:16915] *** An error occurred in MPI_Init
>
> [compute-0-34:16915] *** reported by process [3776708609,5]
>
> [compute-0-34:16915] *** on a NULL communicator
>
> [compute-0-34:16915] *** Unknown error
>
> [compute-0-34:16915] *** MPI_ERRORS_ARE_FATAL (processes in this
> communicator will now abort,
>
> [compute-0-34:16915] *** and potentially your MPI job)
>
> [compute-0-34.local:16902] PMIX ERROR: UNREACHABLE in file
> server/pmix_server.c at line 2147
>
> [compute-0-34.local:16902]