Re: [OMPI users] help with M1 chip macOS openMPI installation

2022-04-22 Thread George Bosilca via users
I think you should work under the assumption of cross-compile, because the
target architecture for the OMPI compile should be x86 and not the local
architecture. It’s been a while I haven’t cross-compile, but I heard Gilles
is doing cross-compilation routinely, so he might be able to help.

  George.


On Fri, Apr 22, 2022 at 13:14 Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> Can you send all the information listed under "For compile problems"
> (please compress!):
>
> https://www.open-mpi.org/community/help/
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> 
> From: users  on behalf of Cici Feng via
> users 
> Sent: Friday, April 22, 2022 5:30 AM
> To: Open MPI Users
> Cc: Cici Feng
> Subject: Re: [OMPI users] help with M1 chip macOS openMPI installation
>
> Hi George,
>
> Thanks so much with the tips and I have installed Rosetta in order for my
> computer to run the Intel software. However, the same error appears as I
> tried to make the file for the OMPI and here's how it looks:
>
> ../../../../opal/threads/thread_usage.h(163): warning #266: function
> "opal_atomic_swap_ptr" declared implicitly
>
>   OPAL_THREAD_DEFINE_ATOMIC_SWAP(void *, intptr_t, ptr)
>
>   ^
>
>
> In file included from ../../../../opal/class/opal_object.h(126),
>
>  from ../../../../opal/dss/dss_types.h(40),
>
>  from ../../../../opal/dss/dss.h(32),
>
>  from pmix3x_server_north.c(27):
>
> ../../../../opal/threads/thread_usage.h(163): warning #120: return value
> type does not match the function type
>
>   OPAL_THREAD_DEFINE_ATOMIC_SWAP(void *, intptr_t, ptr)
>
>   ^
>
>
> pmix3x_server_north.c(157): warning #266: function "opal_atomic_rmb"
> declared implicitly
>
>   OPAL_ACQUIRE_OBJECT(opalcaddy);
>
>   ^
>
>
>   CCLD mca_pmix_pmix3x.la
>
> Making all in mca/pstat/test
>
>   CCLD mca_pstat_test.la
>
> Making all in mca/rcache/grdma
>
>   CCLD mca_rcache_grdma.la
>
> Making all in mca/reachable/weighted
>
>   CCLD mca_reachable_weighted.la
>
> Making all in mca/shmem/mmap
>
>   CCLD mca_shmem_mmap.la
>
> Making all in mca/shmem/posix
>
>   CCLD mca_shmem_posix.la
>
> Making all in mca/shmem/sysv
>
>   CCLD mca_shmem_sysv.la
>
> Making all in tools/wrappers
>
>   CCLD opal_wrapper
>
> Undefined symbols for architecture x86_64:
>
>   "_opal_atomic_add_fetch_32", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_compare_exchange_strong_32", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_compare_exchange_strong_ptr", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_lock", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_lock_init", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_mb", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_rmb", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_sub_fetch_32", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_swap_32", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_swap_ptr", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_unlock", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_wmb", referenced from:
>
>   import-atom in libopen-pal.dylib
>
> ld: symbol(s) not found for architecture x86_64
>
> make[2]: *** [opal_wrapper] Error 1
>
> make[1]: *** [all-recursive] Error 1
>
> make: *** [all-recursive] Error 1
>
>
> I am not sure if the ld part affects the making process or not. Either
> way, error 1 appears as the "opal_wrapper" which I think has been the error
> I kept encoutering.
>
> Is there any explanation to this specific error?
>
> ps. the configure command I used is as followed, provided by the official
> website of MARE2DEM
>
> sudo  ./configure --prefix=/opt/openmpi CC=icc CXX=icc F77=ifort FC=ifort \
> lt_prog_compiler_wl_FC='-Wl,';
> make all install
>
> Thanks again,
> Cici
>
> On Thu, Apr 21, 2022 at 11:18 PM George Bosilca via users <
> users@lists.open-mpi.org> wrote:
> 1. I am not aware of any outstanding OMPI issues with the M1 chip that
> would prevent OMPI from compiling and running efficiently in an M1-based
> setup, assuming the compilation chain is working properly.
>
> 2. M1 supports x86 code via Rosetta, an app provided by Apple to ensure a
> smooth transition from the Intel-based to the M1-based laptop's line. I do
> recall running an OMPI compiled on my Intel laptop on my M1 laptop to test
> the performance of the Rosetta binary translator. We even had 

Re: [OMPI users] Help diagnosing MPI+OpenMP application segmentation fault only when run with --bind-to none

2022-04-22 Thread Angel de Vicente via users
Hello,

"Keller, Rainer"  writes:

> You’re using MPI_Probe() with Threads; that’s not safe.
> Please consider using MPI_Mprobe() together with MPI_Mrecv().

many thanks for the suggestion. I will try with the M variants, though I
was under the impression that mpi_probe() was OK as far as one made sure
that the source and tag matched between the mpi_probe() and the
mpi_recv() calls.

As you can see below, I'm careful with that (in any case I'm not sure
the problems lies there, since the error I get is about invalid
reference in the mpi_probe call itself). 

,
|tid = 0  
| #ifdef _OPENMP  
|tid = omp_get_thread_num()   
| #endif  
| 
|do   
|   if (tid == 0) then
|  call mpi_send(my_rank, 1, mpi_integer, master, ask_job, &  
|   mpi_comm_world, mpierror) 
|  call mpi_probe(master,mpi_any_tag,mpi_comm_world,stat,mpierror)
| 
|  if (stat(mpi_tag) == stop_signal) then 
| call mpi_recv(b_,1,mpi_integer,master,stop_signal, &
|  mpi_comm_world,stat,mpierror)  
|  else   
| call mpi_recv(iyax,1,mpi_integer,master,give_job, & 
|  mpi_comm_world,stat,mpierror)  
|  end if 
|   end if
| 
|   !$omp barrier
| 
|   [... actual work...]
`


> So getting into valgrind may be of help, possibly recompiling Open MPI
> enabling valgrind-checking together with debugging options.

I was hoping to avoid this route, but it certainly is looking like I'll
have to bite the bullet...

Thanks,
-- 
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://research.iac.es/proyecto/polmag/
-
AVISO LEGAL: Este mensaje puede contener información confidencial y/o 
privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido 
por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso no 
autorizadas del contenido de este mensaje está estrictamente prohibida. Más 
información en: https://www.iac.es/es/responsabilidad-legal
DISCLAIMER: This message may contain confidential and / or privileged 
information. If you are not the final recipient or have received it in error, 
please notify the sender immediately. Any unauthorized use of the content of 
this message is strictly prohibited. More information:  
https://www.iac.es/en/disclaimer


Re: [OMPI users] Help diagnosing MPI+OpenMP application segmentation fault only when run with --bind-to none

2022-04-22 Thread Angel de Vicente via users
Hello Jeff,

"Jeff Squyres (jsquyres)"  writes:

> With THREAD_FUNNELED, it means that there can only be one thread in
> MPI at a time -- and it needs to be the same thread as the one that
> called MPI_INIT_THREAD.
>
> Is that the case in your app?


the master rank (i.e. 0) never creates threads, while other ranks go through 
the following
to communicate with it, so I check that it is indeed the master thread
communicating only: 

,
|tid = 0  
| #ifdef _OPENMP  
|tid = omp_get_thread_num()   
| #endif  
| 
|do   
|   if (tid == 0) then
|  call mpi_send(my_rank, 1, mpi_integer, master, ask_job, &  
|   mpi_comm_world, mpierror) 
|  call mpi_probe(master,mpi_any_tag,mpi_comm_world,stat,mpierror)
| 
|  if (stat(mpi_tag) == stop_signal) then 
| call mpi_recv(b_,1,mpi_integer,master,stop_signal, &
|  mpi_comm_world,stat,mpierror)  
|  else   
| call mpi_recv(iyax,1,mpi_integer,master,give_job, & 
|  mpi_comm_world,stat,mpierror)  
|  end if 
|   end if
| 
|   !$omp barrier
| 
|   [... actual work...]
`


> Also, what is your app doing at src/pcorona_main.f90:627?

It is the mpi_probe call above.


In case it can clarify things, my app follows a master-worker paradigm,
where rank 0 hands over jobs, and all mpi ranks > 0 just do the following:

,
| !$OMP PARALLEL DEFAULT(NONE)
| do
|   !  (the code above) 
|   if (tid == 0) then receive job number | stop signal
|  
|   !$OMP DO schedule(dynamic)
|   loop_izax: do izax=sol_nz_min,sol_nz_max
| 
|  [big computing loop body]
| 
|   end do loop_izax  
|   !$OMP END DO  
| 
|   if (tid == 0) then 
|   call mpi_send(iyax,1,mpi_integer,master,results_tag, & 
|mpi_comm_world,mpierror)  
|   call mpi_send(stokes_buf_y,nz*8,mpi_double_precision, &
|master,results_tag,mpi_comm_world,mpierror)   
|   end if 
|  
|   !omp barrier   
|  
| end do   
| !$OMP END PARALLEL  
`



Following Gilles' suggestion, I also tried changing MPI_THREAD_FUNELLED
to MPI_THREAD_MULTIPLE just in case, but I get the same segmentation
fault in the same line (mind you, the segmentation fault doesn't happen
all the time). But again, no issues if running with --bind-to socket
(and no apparent issues at all in the other computer even with --bind-to
none).

Many thanks for any suggestions,
-- 
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://research.iac.es/proyecto/polmag/
-
AVISO LEGAL: Este mensaje puede contener información confidencial y/o 
privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido 
por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso no 
autorizadas del contenido de este mensaje está estrictamente prohibida. Más 
información en: https://www.iac.es/es/responsabilidad-legal
DISCLAIMER: This message may contain confidential and / or privileged 
information. If you are not the final recipient or have received it in error, 
please notify the sender immediately. Any unauthorized use of the content of 
this message is strictly prohibited. More information:  
https://www.iac.es/en/disclaimer


Re: [OMPI users] Help diagnosing MPI+OpenMP application segmentation fault only when run with --bind-to none

2022-04-22 Thread Keller, Rainer via users
Dear Angel,
You’re using MPI_Probe() with Threads; that’s not safe.
Please consider using MPI_Mprobe() together with MPI_Mrecv().

However, you mention running with only one Thread — setting OMP_NUM_THREADS=1, 
assuming you didn’t set using omp_set_num_threads() again, or use num_threads() 
clause…

So getting into valgrind may be of help, possibly recompiling Open MPI enabling 
valgrind-checking together with debugging options.

Best regards,
Rainer


> On 22. Apr 2022, at 14:40, Angel de Vicente via users 
>  wrote:
> 
> Hello,
> 
> I'm running out of ideas, and wonder if someone here could have some
> tips on how to debug a segmentation fault I'm having with my
> application [due to the nature of the problem I'm wondering if the
> problem is with OpenMPI itself rather than my app, though at this point
> I'm not leaning strongly either way].
> 
> The code is hybrid MPI+OpenMP and I compile it with gcc 10.3.0 and
> OpenMPI 4.1.3.
> 
> Usually I was running the code with "mpirun -np X --bind-to none [...]"
> so that the threads created by OpenMP don't get bound to a single core
> and I actually get proper speedup out of OpenMP.
> 
> Now, since I introduced some changes to the code this week (though I
> have read the changes carefully a number of times, and I don't see
> anything suspicious), I now get a segmentation fault sometimes, but only
> when I run with "--bind-to none" and only in my workstation. It is not
> always with the same running configuration, but I can see some pattern,
> and the problem shows up only if I run the version compiled with OpenMP
> support and most of the times only when the number of rank*threads goes
> above 4 or so. If I run it with "--bind-to socket" all looks good all
> the time.
> 
> If I run it in another server, "--bind-to none" doesn't seem to be any
> issue (I submitted the jobs many many times and not a single
> segmentation fault), but in my workstation it fails almost every time if
> using MPI+OpenMP with a handful of threads and with "--bind-to none". In
> this other server I'm running gcc 9.3.0 and OpenMPI 4.1.3.
> 
> For example, setting OMP_NUM_THREADS to 1, I run the code like the
> following, and get the segmentation fault as below:
> 
> ,
> | angelv@sieladon:~/.../Fe13_NL3/t~gauss+isat+istim$ mpirun -np 4 --bind-to 
> none  ../../../../../pcorona+openmp~gauss Fe13_NL3.params 
> |  Reading control file: Fe13_NL3.params
> |   ... Control file parameters broadcasted
> | 
> | [...]
> |  
> |  Starting calculation loop on the line of sight
> |  Receiving results from:2
> |  Receiving results from:1
> | 
> | Program received signal SIGSEGV: Segmentation fault - invalid memory 
> reference.
> | 
> | Backtrace for this error:
> |  Receiving results from:3
> | #0  0x7fd747e7555f in ???
> | #1  0x7fd7488778e1 in ???
> | #2  0x7fd7488667a4 in ???
> | #3  0x7fd7486fe84c in ???
> | #4  0x7fd7489aa9ce in ???
> | #5  0x414959 in __pcorona_main_MOD_main_loop._omp_fn.0
> | at src/pcorona_main.f90:627
> | #6  0x7fd74813ec75 in ???
> | #7  0x412bb0 in pcorona
> | at src/pcorona.f90:49
> | #8  0x40361c in main
> | at src/pcorona.f90:17
> | 
> | [...]
> | 
> | --
> | mpirun noticed that process rank 3 with PID 0 on node sieladon exited on 
> signal 11 (Segmentation fault).
> | ---
> `
> 
> I cannot see inside the MPI library (I don't really know if that would
> be helpful) but line 627 in pcorona_main.f90 is:
> 
> ,
> |  call mpi_probe(master,mpi_any_tag,mpi_comm_world,stat,mpierror)
> `
> 
> Any ideas/suggestions what could be going on or how to try an get some
> more clues about the possible causes of this?
> 
> Many thanks,
> -- 
> Ángel de Vicente
> 
> Tel.: +34 922 605 747
> Web.: http://research.iac.es/proyecto/polmag/
> -
> AVISO LEGAL: Este mensaje puede contener información confidencial y/o 
> privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido 
> por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso 
> no autorizadas del contenido de este mensaje está estrictamente prohibida. 
> Más información en: https://www.iac.es/es/responsabilidad-legal
> DISCLAIMER: This message may contain confidential and / or privileged 
> information. If you are not the final recipient or have received it in error, 
> please notify the sender immediately. Any unauthorized use of the content of 
> this message is strictly prohibited. More information:  
> https://www.iac.es/en/disclaimer

-
Prof. Dr.-Ing. Rainer Keller, HS Esslingen
Studiengangkoordinator Master Angewandte Informatik
Professor für Betriebssysteme, verteilte und parallele Systeme
Fakultät Informatik und 

Re: [OMPI users] help with M1 chip macOS openMPI installation

2022-04-22 Thread Jeff Squyres (jsquyres) via users
Can you send all the information listed under "For compile problems" (please 
compress!):

https://www.open-mpi.org/community/help/

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Cici Feng via users 

Sent: Friday, April 22, 2022 5:30 AM
To: Open MPI Users
Cc: Cici Feng
Subject: Re: [OMPI users] help with M1 chip macOS openMPI installation

Hi George,

Thanks so much with the tips and I have installed Rosetta in order for my 
computer to run the Intel software. However, the same error appears as I tried 
to make the file for the OMPI and here's how it looks:

../../../../opal/threads/thread_usage.h(163): warning #266: function 
"opal_atomic_swap_ptr" declared implicitly

  OPAL_THREAD_DEFINE_ATOMIC_SWAP(void *, intptr_t, ptr)

  ^


In file included from ../../../../opal/class/opal_object.h(126),

 from ../../../../opal/dss/dss_types.h(40),

 from ../../../../opal/dss/dss.h(32),

 from pmix3x_server_north.c(27):

../../../../opal/threads/thread_usage.h(163): warning #120: return value type 
does not match the function type

  OPAL_THREAD_DEFINE_ATOMIC_SWAP(void *, intptr_t, ptr)

  ^


pmix3x_server_north.c(157): warning #266: function "opal_atomic_rmb" declared 
implicitly

  OPAL_ACQUIRE_OBJECT(opalcaddy);

  ^


  CCLD mca_pmix_pmix3x.la

Making all in mca/pstat/test

  CCLD mca_pstat_test.la

Making all in mca/rcache/grdma

  CCLD mca_rcache_grdma.la

Making all in mca/reachable/weighted

  CCLD mca_reachable_weighted.la

Making all in mca/shmem/mmap

  CCLD mca_shmem_mmap.la

Making all in mca/shmem/posix

  CCLD mca_shmem_posix.la

Making all in mca/shmem/sysv

  CCLD mca_shmem_sysv.la

Making all in tools/wrappers

  CCLD opal_wrapper

Undefined symbols for architecture x86_64:

  "_opal_atomic_add_fetch_32", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_compare_exchange_strong_32", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_compare_exchange_strong_ptr", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_lock", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_lock_init", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_mb", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_rmb", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_sub_fetch_32", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_swap_32", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_swap_ptr", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_unlock", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_wmb", referenced from:

  import-atom in libopen-pal.dylib

ld: symbol(s) not found for architecture x86_64

make[2]: *** [opal_wrapper] Error 1

make[1]: *** [all-recursive] Error 1

make: *** [all-recursive] Error 1


I am not sure if the ld part affects the making process or not. Either way, 
error 1 appears as the "opal_wrapper" which I think has been the error I kept 
encoutering.

Is there any explanation to this specific error?

ps. the configure command I used is as followed, provided by the official 
website of MARE2DEM

sudo  ./configure --prefix=/opt/openmpi CC=icc CXX=icc F77=ifort FC=ifort \
lt_prog_compiler_wl_FC='-Wl,';
make all install

Thanks again,
Cici

On Thu, Apr 21, 2022 at 11:18 PM George Bosilca via users 
mailto:users@lists.open-mpi.org>> wrote:
1. I am not aware of any outstanding OMPI issues with the M1 chip that would 
prevent OMPI from compiling and running efficiently in an M1-based setup, 
assuming the compilation chain is working properly.

2. M1 supports x86 code via Rosetta, an app provided by Apple to ensure a 
smooth transition from the Intel-based to the M1-based laptop's line. I do 
recall running an OMPI compiled on my Intel laptop on my M1 laptop to test the 
performance of the Rosetta binary translator. We even had some discussions 
about this, on the mailing list (or github issues).

3. Based on your original message, and their webpage, MARE2DEM is not 
supporting any other compilation chain than Intel. As explained above, that 
might not be by itself a showstopper, because you can run x86 code on the M1 
chip, using Rosetta. However, MARE2DEM relies on MKL, the Intel Math Library, 
and that library will not run on a M1 chip.

  George.


On Thu, Apr 21, 2022 at 7:02 AM Jeff Squyres (jsquyres) via users 
mailto:users@lists.open-mpi.org>> wrote:
A little more color on Gilles' answer: I believe that we had some Open MPI 
community members work on adding M1 support to Open MPI, but 

Re: [OMPI users] Help diagnosing MPI+OpenMP application segmentation fault only when run with --bind-to none

2022-04-22 Thread Jeff Squyres (jsquyres) via users
With THREAD_FUNNELED, it means that there can only be one thread in MPI at a 
time -- and it needs to be the same thread as the one that called 
MPI_INIT_THREAD.

Is that the case in your app?

Also, what is your app doing at src/pcorona_main.f90:627?  It is making a call 
to MPI, or something else?  It might be useful to compile Open MPI (and/or 
other libraries that you're using) with -g so that you can get more meaningful 
stack traces upon error -- that might give some insight into where / why the 
failure is occurring.

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Angel de Vicente 
via users 
Sent: Friday, April 22, 2022 10:54 AM
To: Gilles Gouaillardet via users
Cc: Angel de Vicente
Subject: Re: [OMPI users] Help diagnosing MPI+OpenMP application segmentation 
fault only when run with --bind-to none

Thanks Gilles,

Gilles Gouaillardet via users  writes:

> You can first double check you
> MPI_Init_thread(..., MPI_THREAD_MULTIPLE, ...)

my code uses "mpi_thread_funneled" and OpenMPI was compiled with
MPI_THREAD_MULTIPLE support:

,
| ompi_info | grep  -i thread
|   Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, 
OMPI progress: no, ORTE progress: yes, Event lib: yes)
|FT Checkpoint support: no (checkpoint thread: no)
`

Cheers,
--
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://research.iac.es/proyecto/polmag/
-
AVISO LEGAL: Este mensaje puede contener información confidencial y/o 
privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido 
por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso no 
autorizadas del contenido de este mensaje está estrictamente prohibida. Más 
información en: https://www.iac.es/es/responsabilidad-legal
DISCLAIMER: This message may contain confidential and / or privileged 
information. If you are not the final recipient or have received it in error, 
please notify the sender immediately. Any unauthorized use of the content of 
this message is strictly prohibited. More information:  
https://www.iac.es/en/disclaimer


Re: [OMPI users] Help diagnosing MPI+OpenMP application segmentation fault only when run with --bind-to none

2022-04-22 Thread Angel de Vicente via users
Thanks Gilles,

Gilles Gouaillardet via users  writes:

> You can first double check you
> MPI_Init_thread(..., MPI_THREAD_MULTIPLE, ...)

my code uses "mpi_thread_funneled" and OpenMPI was compiled with
MPI_THREAD_MULTIPLE support:

,
| ompi_info | grep  -i thread
|   Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, 
OMPI progress: no, ORTE progress: yes, Event lib: yes)
|FT Checkpoint support: no (checkpoint thread: no)
`

Cheers,
-- 
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://research.iac.es/proyecto/polmag/
-
AVISO LEGAL: Este mensaje puede contener información confidencial y/o 
privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido 
por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso no 
autorizadas del contenido de este mensaje está estrictamente prohibida. Más 
información en: https://www.iac.es/es/responsabilidad-legal
DISCLAIMER: This message may contain confidential and / or privileged 
information. If you are not the final recipient or have received it in error, 
please notify the sender immediately. Any unauthorized use of the content of 
this message is strictly prohibited. More information:  
https://www.iac.es/en/disclaimer


Re: [OMPI users] Help diagnosing MPI+OpenMP application segmentation fault only when run with --bind-to none

2022-04-22 Thread Gilles Gouaillardet via users
You can first double check you
MPI_Init_thread(..., MPI_THREAD_MULTIPLE, ...)
And the provided level is MPI_THREAD_MULTIPLE as you requested.

Cheers,

Gilles

On Fri, Apr 22, 2022, 21:45 Angel de Vicente via users <
users@lists.open-mpi.org> wrote:

> Hello,
>
> I'm running out of ideas, and wonder if someone here could have some
> tips on how to debug a segmentation fault I'm having with my
> application [due to the nature of the problem I'm wondering if the
> problem is with OpenMPI itself rather than my app, though at this point
> I'm not leaning strongly either way].
>
> The code is hybrid MPI+OpenMP and I compile it with gcc 10.3.0 and
> OpenMPI 4.1.3.
>
> Usually I was running the code with "mpirun -np X --bind-to none [...]"
> so that the threads created by OpenMP don't get bound to a single core
> and I actually get proper speedup out of OpenMP.
>
> Now, since I introduced some changes to the code this week (though I
> have read the changes carefully a number of times, and I don't see
> anything suspicious), I now get a segmentation fault sometimes, but only
> when I run with "--bind-to none" and only in my workstation. It is not
> always with the same running configuration, but I can see some pattern,
> and the problem shows up only if I run the version compiled with OpenMP
> support and most of the times only when the number of rank*threads goes
> above 4 or so. If I run it with "--bind-to socket" all looks good all
> the time.
>
> If I run it in another server, "--bind-to none" doesn't seem to be any
> issue (I submitted the jobs many many times and not a single
> segmentation fault), but in my workstation it fails almost every time if
> using MPI+OpenMP with a handful of threads and with "--bind-to none". In
> this other server I'm running gcc 9.3.0 and OpenMPI 4.1.3.
>
> For example, setting OMP_NUM_THREADS to 1, I run the code like the
> following, and get the segmentation fault as below:
>
> ,
> | angelv@sieladon:~/.../Fe13_NL3/t~gauss+isat+istim$ mpirun -np 4
> --bind-to none  ../../../../../pcorona+openmp~gauss Fe13_NL3.params
> |  Reading control file: Fe13_NL3.params
> |   ... Control file parameters broadcasted
> |
> | [...]
> |
> |  Starting calculation loop on the line of sight
> |  Receiving results from:2
> |  Receiving results from:1
> |
> | Program received signal SIGSEGV: Segmentation fault - invalid memory
> reference.
> |
> | Backtrace for this error:
> |  Receiving results from:3
> | #0  0x7fd747e7555f in ???
> | #1  0x7fd7488778e1 in ???
> | #2  0x7fd7488667a4 in ???
> | #3  0x7fd7486fe84c in ???
> | #4  0x7fd7489aa9ce in ???
> | #5  0x414959 in __pcorona_main_MOD_main_loop._omp_fn.0
> | at src/pcorona_main.f90:627
> | #6  0x7fd74813ec75 in ???
> | #7  0x412bb0 in pcorona
> | at src/pcorona.f90:49
> | #8  0x40361c in main
> | at src/pcorona.f90:17
> |
> | [...]
> |
> |
> --
> | mpirun noticed that process rank 3 with PID 0 on node sieladon exited on
> signal 11 (Segmentation fault).
> | ---
> `
>
> I cannot see inside the MPI library (I don't really know if that would
> be helpful) but line 627 in pcorona_main.f90 is:
>
> ,
> |  call
> mpi_probe(master,mpi_any_tag,mpi_comm_world,stat,mpierror)
> `
>
> Any ideas/suggestions what could be going on or how to try an get some
> more clues about the possible causes of this?
>
> Many thanks,
> --
> Ángel de Vicente
>
> Tel.: +34 922 605 747
> Web.: http://research.iac.es/proyecto/polmag/
>
> -
> AVISO LEGAL: Este mensaje puede contener información confidencial y/o
> privilegiada. Si usted no es el destinatario final del mismo o lo ha
> recibido por error, por favor notifíquelo al remitente inmediatamente.
> Cualquier uso no autorizadas del contenido de este mensaje está
> estrictamente prohibida. Más información en:
> https://www.iac.es/es/responsabilidad-legal
> DISCLAIMER: This message may contain confidential and / or privileged
> information. If you are not the final recipient or have received it in
> error, please notify the sender immediately. Any unauthorized use of the
> content of this message is strictly prohibited. More information:
> https://www.iac.es/en/disclaimer
>


[OMPI users] Help diagnosing MPI+OpenMP application segmentation fault only when run with --bind-to none

2022-04-22 Thread Angel de Vicente via users
Hello,

I'm running out of ideas, and wonder if someone here could have some
tips on how to debug a segmentation fault I'm having with my
application [due to the nature of the problem I'm wondering if the
problem is with OpenMPI itself rather than my app, though at this point
I'm not leaning strongly either way].

The code is hybrid MPI+OpenMP and I compile it with gcc 10.3.0 and
OpenMPI 4.1.3.

Usually I was running the code with "mpirun -np X --bind-to none [...]"
so that the threads created by OpenMP don't get bound to a single core
and I actually get proper speedup out of OpenMP.

Now, since I introduced some changes to the code this week (though I
have read the changes carefully a number of times, and I don't see
anything suspicious), I now get a segmentation fault sometimes, but only
when I run with "--bind-to none" and only in my workstation. It is not
always with the same running configuration, but I can see some pattern,
and the problem shows up only if I run the version compiled with OpenMP
support and most of the times only when the number of rank*threads goes
above 4 or so. If I run it with "--bind-to socket" all looks good all
the time.

If I run it in another server, "--bind-to none" doesn't seem to be any
issue (I submitted the jobs many many times and not a single
segmentation fault), but in my workstation it fails almost every time if
using MPI+OpenMP with a handful of threads and with "--bind-to none". In
this other server I'm running gcc 9.3.0 and OpenMPI 4.1.3.

For example, setting OMP_NUM_THREADS to 1, I run the code like the
following, and get the segmentation fault as below:

,
| angelv@sieladon:~/.../Fe13_NL3/t~gauss+isat+istim$ mpirun -np 4 --bind-to 
none  ../../../../../pcorona+openmp~gauss Fe13_NL3.params 
|  Reading control file: Fe13_NL3.params
|   ... Control file parameters broadcasted
| 
| [...]
|  
|  Starting calculation loop on the line of sight
|  Receiving results from:2
|  Receiving results from:1
| 
| Program received signal SIGSEGV: Segmentation fault - invalid memory 
reference.
| 
| Backtrace for this error:
|  Receiving results from:3
| #0  0x7fd747e7555f in ???
| #1  0x7fd7488778e1 in ???
| #2  0x7fd7488667a4 in ???
| #3  0x7fd7486fe84c in ???
| #4  0x7fd7489aa9ce in ???
| #5  0x414959 in __pcorona_main_MOD_main_loop._omp_fn.0
| at src/pcorona_main.f90:627
| #6  0x7fd74813ec75 in ???
| #7  0x412bb0 in pcorona
| at src/pcorona.f90:49
| #8  0x40361c in main
| at src/pcorona.f90:17
| 
| [...]
| 
| --
| mpirun noticed that process rank 3 with PID 0 on node sieladon exited on 
signal 11 (Segmentation fault).
| ---
`

I cannot see inside the MPI library (I don't really know if that would
be helpful) but line 627 in pcorona_main.f90 is:

,
|  call mpi_probe(master,mpi_any_tag,mpi_comm_world,stat,mpierror)
`

Any ideas/suggestions what could be going on or how to try an get some
more clues about the possible causes of this?

Many thanks,
-- 
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://research.iac.es/proyecto/polmag/
-
AVISO LEGAL: Este mensaje puede contener información confidencial y/o 
privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido 
por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso no 
autorizadas del contenido de este mensaje está estrictamente prohibida. Más 
información en: https://www.iac.es/es/responsabilidad-legal
DISCLAIMER: This message may contain confidential and / or privileged 
information. If you are not the final recipient or have received it in error, 
please notify the sender immediately. Any unauthorized use of the content of 
this message is strictly prohibited. More information:  
https://www.iac.es/en/disclaimer


Re: [OMPI users] help with M1 chip macOS openMPI installation

2022-04-22 Thread Cici Feng via users
Hi George,

Thanks so much with the tips and I have installed Rosetta in order for my
computer to run the Intel software. However, the same error appears as I
tried to make the file for the OMPI and here's how it looks:

../../../../opal/threads/thread_usage.h(163): warning #266: function
"opal_atomic_swap_ptr" declared implicitly

  OPAL_THREAD_DEFINE_ATOMIC_SWAP(void *, intptr_t, ptr)

  ^


In file included from ../../../../opal/class/opal_object.h(126),

 from ../../../../opal/dss/dss_types.h(40),

 from ../../../../opal/dss/dss.h(32),

 from pmix3x_server_north.c(27):

../../../../opal/threads/thread_usage.h(163): warning #120: return value
type does not match the function type

  OPAL_THREAD_DEFINE_ATOMIC_SWAP(void *, intptr_t, ptr)

  ^


pmix3x_server_north.c(157): warning #266: function "opal_atomic_rmb"
declared implicitly

  OPAL_ACQUIRE_OBJECT(opalcaddy);

  ^


  CCLD mca_pmix_pmix3x.la

Making all in mca/pstat/test

  CCLD mca_pstat_test.la

Making all in mca/rcache/grdma

  CCLD mca_rcache_grdma.la

Making all in mca/reachable/weighted

  CCLD mca_reachable_weighted.la

Making all in mca/shmem/mmap

  CCLD mca_shmem_mmap.la

Making all in mca/shmem/posix

  CCLD mca_shmem_posix.la

Making all in mca/shmem/sysv

  CCLD mca_shmem_sysv.la

Making all in tools/wrappers

  CCLD opal_wrapper

Undefined symbols for architecture x86_64:

  "_opal_atomic_add_fetch_32", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_compare_exchange_strong_32", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_compare_exchange_strong_ptr", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_lock", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_lock_init", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_mb", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_rmb", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_sub_fetch_32", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_swap_32", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_swap_ptr", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_unlock", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_wmb", referenced from:

  import-atom in libopen-pal.dylib

ld: symbol(s) not found for architecture x86_64

make[2]: *** [opal_wrapper] Error 1

make[1]: *** [all-recursive] Error 1

make: *** [all-recursive] Error 1


I am not sure if the ld part affects the making process or not. Either way,
error 1 appears as the "opal_wrapper" which I think has been the error I
kept encoutering.

Is there any explanation to this specific error?

ps. the configure command I used is as followed, provided by the official
website of MARE2DEM

sudo  ./configure --prefix=/opt/openmpi CC=icc CXX=icc F77=ifort
FC=ifort \lt_prog_compiler_wl_FC='-Wl,';
make all install


Thanks again,
Cici

On Thu, Apr 21, 2022 at 11:18 PM George Bosilca via users <
users@lists.open-mpi.org> wrote:

> 1. I am not aware of any outstanding OMPI issues with the M1 chip that
> would prevent OMPI from compiling and running efficiently in an M1-based
> setup, assuming the compilation chain is working properly.
>
> 2. M1 supports x86 code via Rosetta, an app provided by Apple to ensure a
> smooth transition from the Intel-based to the M1-based laptop's line. I do
> recall running an OMPI compiled on my Intel laptop on my M1 laptop to test
> the performance of the Rosetta binary translator. We even had some
> discussions about this, on the mailing list (or github issues).
>
> 3. Based on your original message, and their webpage, MARE2DEM is not
> supporting any other compilation chain than Intel. As explained above, that
> might not be by itself a showstopper, because you can run x86 code on the
> M1 chip, using Rosetta. However, MARE2DEM relies on MKL, the Intel Math
> Library, and that library will not run on a M1 chip.
>
>   George.
>
>
> On Thu, Apr 21, 2022 at 7:02 AM Jeff Squyres (jsquyres) via users <
> users@lists.open-mpi.org> wrote:
>
>> A little more color on Gilles' answer: I believe that we had some Open
>> MPI community members work on adding M1 support to Open MPI, but Gilles is
>> absolutely correct: the underlying compiler has to support the M1, or you
>> won't get anywhere.
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>>
>> 
>> From: users  on behalf of Cici Feng
>> via users 
>> Sent: Thursday, April 21, 2022 6:11 AM
>> To: Open MPI Users
>> Cc: Cici Feng
>> Subject: Re: [OMPI users] help with M1 chip macOS openMPI installation
>>
>> Gilles,
>>
>> Thank you so much for the quick response!
>> openMPI installed by brew is compiled on gcc and gfortran using the
>> original compilers by Apple. Now I