Re: [OMPI users] Help diagnosing MPI+OpenMP application segmentation fault only when run with --bind-to none

2022-04-26 Thread Angel de Vicente via users
Hello,

thanks for your help and suggestions.

At the end it was no issue with OpenMPI or with any other system stuff,
but rather a single line in our code. I thought I was doing the tests
with the -fbounds-check option, but it turns out I was not, arrrghh!! At
some point I was writing outside one of our arrays, and you can imagine
the rest... That it was happening only when I was running with
'--bind-to none' and only in my workstation brought me down all the
wrong debugging paths. Once I realized -fbounds-check was not being
used, figuring out the issue was a matter of seconds...

Our code is now happily performing the +3000 tests without a hitch.

Cheers,
-- 
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://research.iac.es/proyecto/polmag/
-
AVISO LEGAL: Este mensaje puede contener información confidencial y/o 
privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido 
por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso no 
autorizadas del contenido de este mensaje está estrictamente prohibida. Más 
información en: https://www.iac.es/es/responsabilidad-legal
DISCLAIMER: This message may contain confidential and / or privileged 
information. If you are not the final recipient or have received it in error, 
please notify the sender immediately. Any unauthorized use of the content of 
this message is strictly prohibited. More information:  
https://www.iac.es/en/disclaimer


Re: [OMPI users] Help diagnosing MPI+OpenMP application segmentation fault only when run with --bind-to none

2022-04-23 Thread Protze, Joachim via users
Instead of using if(tid==0) you should use the fortran equivalent of:
#pragma omp masked
Or
#pragma omp master
For pre-omp-5.1 code

This way you don't need to rely on preprocessor magic using the _OPENMP macro. 
At the same time you better express your intended OpenMP semantics.

- Joachim

From: users  on behalf of Angel de Vicente 
via users 
Sent: Friday, April 22, 2022 10:31:38 PM
To: Keller, Rainer 
Cc: Angel de Vicente ; Open MPI Users 

Subject: Re: [OMPI users] Help diagnosing MPI+OpenMP application segmentation 
fault only when run with --bind-to none

Hello,

"Keller, Rainer"  writes:

> You’re using MPI_Probe() with Threads; that’s not safe.
> Please consider using MPI_Mprobe() together with MPI_Mrecv().

many thanks for the suggestion. I will try with the M variants, though I
was under the impression that mpi_probe() was OK as far as one made sure
that the source and tag matched between the mpi_probe() and the
mpi_recv() calls.

As you can see below, I'm careful with that (in any case I'm not sure
the problems lies there, since the error I get is about invalid
reference in the mpi_probe call itself).

,
|tid = 0
| #ifdef _OPENMP
|tid = omp_get_thread_num()
| #endif
|
|do
|   if (tid == 0) then
|  call mpi_send(my_rank, 1, mpi_integer, master, ask_job, &
|   mpi_comm_world, mpierror)
|  call mpi_probe(master,mpi_any_tag,mpi_comm_world,stat,mpierror)
|
|  if (stat(mpi_tag) == stop_signal) then
| call mpi_recv(b_,1,mpi_integer,master,stop_signal, &
|  mpi_comm_world,stat,mpierror)
|  else
| call mpi_recv(iyax,1,mpi_integer,master,give_job, &
|  mpi_comm_world,stat,mpierror)
|  end if
|   end if
|
|   !$omp barrier
|
|   [... actual work...]
`


> So getting into valgrind may be of help, possibly recompiling Open MPI
> enabling valgrind-checking together with debugging options.

I was hoping to avoid this route, but it certainly is looking like I'll
have to bite the bullet...

Thanks,
--
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://research.iac.es/proyecto/polmag/
-
AVISO LEGAL: Este mensaje puede contener información confidencial y/o 
privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido 
por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso no 
autorizadas del contenido de este mensaje está estrictamente prohibida. Más 
información en: https://www.iac.es/es/responsabilidad-legal
DISCLAIMER: This message may contain confidential and / or privileged 
information. If you are not the final recipient or have received it in error, 
please notify the sender immediately. Any unauthorized use of the content of 
this message is strictly prohibited. More information:  
https://www.iac.es/en/disclaimer


Re: [OMPI users] help with M1 chip macOS openMPI installation

2022-04-22 Thread George Bosilca via users
I think you should work under the assumption of cross-compile, because the
target architecture for the OMPI compile should be x86 and not the local
architecture. It’s been a while I haven’t cross-compile, but I heard Gilles
is doing cross-compilation routinely, so he might be able to help.

  George.


On Fri, Apr 22, 2022 at 13:14 Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> Can you send all the information listed under "For compile problems"
> (please compress!):
>
> https://www.open-mpi.org/community/help/
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> 
> From: users  on behalf of Cici Feng via
> users 
> Sent: Friday, April 22, 2022 5:30 AM
> To: Open MPI Users
> Cc: Cici Feng
> Subject: Re: [OMPI users] help with M1 chip macOS openMPI installation
>
> Hi George,
>
> Thanks so much with the tips and I have installed Rosetta in order for my
> computer to run the Intel software. However, the same error appears as I
> tried to make the file for the OMPI and here's how it looks:
>
> ../../../../opal/threads/thread_usage.h(163): warning #266: function
> "opal_atomic_swap_ptr" declared implicitly
>
>   OPAL_THREAD_DEFINE_ATOMIC_SWAP(void *, intptr_t, ptr)
>
>   ^
>
>
> In file included from ../../../../opal/class/opal_object.h(126),
>
>  from ../../../../opal/dss/dss_types.h(40),
>
>  from ../../../../opal/dss/dss.h(32),
>
>  from pmix3x_server_north.c(27):
>
> ../../../../opal/threads/thread_usage.h(163): warning #120: return value
> type does not match the function type
>
>   OPAL_THREAD_DEFINE_ATOMIC_SWAP(void *, intptr_t, ptr)
>
>   ^
>
>
> pmix3x_server_north.c(157): warning #266: function "opal_atomic_rmb"
> declared implicitly
>
>   OPAL_ACQUIRE_OBJECT(opalcaddy);
>
>   ^
>
>
>   CCLD mca_pmix_pmix3x.la<http://mca_pmix_pmix3x.la>
>
> Making all in mca/pstat/test
>
>   CCLD mca_pstat_test.la<http://mca_pstat_test.la>
>
> Making all in mca/rcache/grdma
>
>   CCLD mca_rcache_grdma.la<http://mca_rcache_grdma.la>
>
> Making all in mca/reachable/weighted
>
>   CCLD mca_reachable_weighted.la<http://mca_reachable_weighted.la>
>
> Making all in mca/shmem/mmap
>
>   CCLD mca_shmem_mmap.la<http://mca_shmem_mmap.la>
>
> Making all in mca/shmem/posix
>
>   CCLD mca_shmem_posix.la<http://mca_shmem_posix.la>
>
> Making all in mca/shmem/sysv
>
>   CCLD mca_shmem_sysv.la<http://mca_shmem_sysv.la>
>
> Making all in tools/wrappers
>
>   CCLD opal_wrapper
>
> Undefined symbols for architecture x86_64:
>
>   "_opal_atomic_add_fetch_32", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_compare_exchange_strong_32", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_compare_exchange_strong_ptr", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_lock", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_lock_init", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_mb", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_rmb", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_sub_fetch_32", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_swap_32", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_swap_ptr", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_unlock", referenced from:
>
>   import-atom in libopen-pal.dylib
>
>   "_opal_atomic_wmb", referenced from:
>
>   import-atom in libopen-pal.dylib
>
> ld: symbol(s) not found for architecture x86_64
>
> make[2]: *** [opal_wrapper] Error 1
>
> make[1]: *** [all-recursive] Error 1
>
> make: *** [all-recursive] Error 1
>
>
> I am not sure if the ld part affects the making process or not. Either
> way, error 1 appears as the "opal_wrapper" which I think has been the error
> I kept encoutering.
>
> Is there any explanation to this specific error?
>
> ps. the configure command I used is as followed, provided by the official
> website of MARE2DEM
>
> sudo  ./configure --prefix=/opt/openmpi CC=icc CXX=icc F77=ifort FC=ifort \
> lt_prog_compiler_wl_FC='-Wl,';
> ma

Re: [OMPI users] Help diagnosing MPI+OpenMP application segmentation fault only when run with --bind-to none

2022-04-22 Thread Angel de Vicente via users
Hello,

"Keller, Rainer"  writes:

> You’re using MPI_Probe() with Threads; that’s not safe.
> Please consider using MPI_Mprobe() together with MPI_Mrecv().

many thanks for the suggestion. I will try with the M variants, though I
was under the impression that mpi_probe() was OK as far as one made sure
that the source and tag matched between the mpi_probe() and the
mpi_recv() calls.

As you can see below, I'm careful with that (in any case I'm not sure
the problems lies there, since the error I get is about invalid
reference in the mpi_probe call itself). 

,
|tid = 0  
| #ifdef _OPENMP  
|tid = omp_get_thread_num()   
| #endif  
| 
|do   
|   if (tid == 0) then
|  call mpi_send(my_rank, 1, mpi_integer, master, ask_job, &  
|   mpi_comm_world, mpierror) 
|  call mpi_probe(master,mpi_any_tag,mpi_comm_world,stat,mpierror)
| 
|  if (stat(mpi_tag) == stop_signal) then 
| call mpi_recv(b_,1,mpi_integer,master,stop_signal, &
|  mpi_comm_world,stat,mpierror)  
|  else   
| call mpi_recv(iyax,1,mpi_integer,master,give_job, & 
|  mpi_comm_world,stat,mpierror)  
|  end if 
|   end if
| 
|   !$omp barrier
| 
|   [... actual work...]
`


> So getting into valgrind may be of help, possibly recompiling Open MPI
> enabling valgrind-checking together with debugging options.

I was hoping to avoid this route, but it certainly is looking like I'll
have to bite the bullet...

Thanks,
-- 
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://research.iac.es/proyecto/polmag/
-
AVISO LEGAL: Este mensaje puede contener información confidencial y/o 
privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido 
por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso no 
autorizadas del contenido de este mensaje está estrictamente prohibida. Más 
información en: https://www.iac.es/es/responsabilidad-legal
DISCLAIMER: This message may contain confidential and / or privileged 
information. If you are not the final recipient or have received it in error, 
please notify the sender immediately. Any unauthorized use of the content of 
this message is strictly prohibited. More information:  
https://www.iac.es/en/disclaimer


Re: [OMPI users] Help diagnosing MPI+OpenMP application segmentation fault only when run with --bind-to none

2022-04-22 Thread Angel de Vicente via users
Hello Jeff,

"Jeff Squyres (jsquyres)"  writes:

> With THREAD_FUNNELED, it means that there can only be one thread in
> MPI at a time -- and it needs to be the same thread as the one that
> called MPI_INIT_THREAD.
>
> Is that the case in your app?


the master rank (i.e. 0) never creates threads, while other ranks go through 
the following
to communicate with it, so I check that it is indeed the master thread
communicating only: 

,
|tid = 0  
| #ifdef _OPENMP  
|tid = omp_get_thread_num()   
| #endif  
| 
|do   
|   if (tid == 0) then
|  call mpi_send(my_rank, 1, mpi_integer, master, ask_job, &  
|   mpi_comm_world, mpierror) 
|  call mpi_probe(master,mpi_any_tag,mpi_comm_world,stat,mpierror)
| 
|  if (stat(mpi_tag) == stop_signal) then 
| call mpi_recv(b_,1,mpi_integer,master,stop_signal, &
|  mpi_comm_world,stat,mpierror)  
|  else   
| call mpi_recv(iyax,1,mpi_integer,master,give_job, & 
|  mpi_comm_world,stat,mpierror)  
|  end if 
|   end if
| 
|   !$omp barrier
| 
|   [... actual work...]
`


> Also, what is your app doing at src/pcorona_main.f90:627?

It is the mpi_probe call above.


In case it can clarify things, my app follows a master-worker paradigm,
where rank 0 hands over jobs, and all mpi ranks > 0 just do the following:

,
| !$OMP PARALLEL DEFAULT(NONE)
| do
|   !  (the code above) 
|   if (tid == 0) then receive job number | stop signal
|  
|   !$OMP DO schedule(dynamic)
|   loop_izax: do izax=sol_nz_min,sol_nz_max
| 
|  [big computing loop body]
| 
|   end do loop_izax  
|   !$OMP END DO  
| 
|   if (tid == 0) then 
|   call mpi_send(iyax,1,mpi_integer,master,results_tag, & 
|mpi_comm_world,mpierror)  
|   call mpi_send(stokes_buf_y,nz*8,mpi_double_precision, &
|master,results_tag,mpi_comm_world,mpierror)   
|   end if 
|  
|   !omp barrier   
|  
| end do   
| !$OMP END PARALLEL  
`



Following Gilles' suggestion, I also tried changing MPI_THREAD_FUNELLED
to MPI_THREAD_MULTIPLE just in case, but I get the same segmentation
fault in the same line (mind you, the segmentation fault doesn't happen
all the time). But again, no issues if running with --bind-to socket
(and no apparent issues at all in the other computer even with --bind-to
none).

Many thanks for any suggestions,
-- 
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://research.iac.es/proyecto/polmag/
-
AVISO LEGAL: Este mensaje puede contener información confidencial y/o 
privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido 
por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso no 
autorizadas del contenido de este mensaje está estrictamente prohibida. Más 
información en: https://www.iac.es/es/responsabilidad-legal
DISCLAIMER: This message may contain confidential and / or privileged 
information. If you are not the final recipient or have received it in error, 
please notify the sender immediately. Any unauthorized use of the content of 
this message is strictly prohibited. More information:  
https://www.iac.es/en/disclaimer


Re: [OMPI users] Help diagnosing MPI+OpenMP application segmentation fault only when run with --bind-to none

2022-04-22 Thread Keller, Rainer via users
Dear Angel,
You’re using MPI_Probe() with Threads; that’s not safe.
Please consider using MPI_Mprobe() together with MPI_Mrecv().

However, you mention running with only one Thread — setting OMP_NUM_THREADS=1, 
assuming you didn’t set using omp_set_num_threads() again, or use num_threads() 
clause…

So getting into valgrind may be of help, possibly recompiling Open MPI enabling 
valgrind-checking together with debugging options.

Best regards,
Rainer


> On 22. Apr 2022, at 14:40, Angel de Vicente via users 
>  wrote:
> 
> Hello,
> 
> I'm running out of ideas, and wonder if someone here could have some
> tips on how to debug a segmentation fault I'm having with my
> application [due to the nature of the problem I'm wondering if the
> problem is with OpenMPI itself rather than my app, though at this point
> I'm not leaning strongly either way].
> 
> The code is hybrid MPI+OpenMP and I compile it with gcc 10.3.0 and
> OpenMPI 4.1.3.
> 
> Usually I was running the code with "mpirun -np X --bind-to none [...]"
> so that the threads created by OpenMP don't get bound to a single core
> and I actually get proper speedup out of OpenMP.
> 
> Now, since I introduced some changes to the code this week (though I
> have read the changes carefully a number of times, and I don't see
> anything suspicious), I now get a segmentation fault sometimes, but only
> when I run with "--bind-to none" and only in my workstation. It is not
> always with the same running configuration, but I can see some pattern,
> and the problem shows up only if I run the version compiled with OpenMP
> support and most of the times only when the number of rank*threads goes
> above 4 or so. If I run it with "--bind-to socket" all looks good all
> the time.
> 
> If I run it in another server, "--bind-to none" doesn't seem to be any
> issue (I submitted the jobs many many times and not a single
> segmentation fault), but in my workstation it fails almost every time if
> using MPI+OpenMP with a handful of threads and with "--bind-to none". In
> this other server I'm running gcc 9.3.0 and OpenMPI 4.1.3.
> 
> For example, setting OMP_NUM_THREADS to 1, I run the code like the
> following, and get the segmentation fault as below:
> 
> ,
> | angelv@sieladon:~/.../Fe13_NL3/t~gauss+isat+istim$ mpirun -np 4 --bind-to 
> none  ../../../../../pcorona+openmp~gauss Fe13_NL3.params 
> |  Reading control file: Fe13_NL3.params
> |   ... Control file parameters broadcasted
> | 
> | [...]
> |  
> |  Starting calculation loop on the line of sight
> |  Receiving results from:2
> |  Receiving results from:1
> | 
> | Program received signal SIGSEGV: Segmentation fault - invalid memory 
> reference.
> | 
> | Backtrace for this error:
> |  Receiving results from:3
> | #0  0x7fd747e7555f in ???
> | #1  0x7fd7488778e1 in ???
> | #2  0x7fd7488667a4 in ???
> | #3  0x7fd7486fe84c in ???
> | #4  0x7fd7489aa9ce in ???
> | #5  0x414959 in __pcorona_main_MOD_main_loop._omp_fn.0
> | at src/pcorona_main.f90:627
> | #6  0x7fd74813ec75 in ???
> | #7  0x412bb0 in pcorona
> | at src/pcorona.f90:49
> | #8  0x40361c in main
> | at src/pcorona.f90:17
> | 
> | [...]
> | 
> | --
> | mpirun noticed that process rank 3 with PID 0 on node sieladon exited on 
> signal 11 (Segmentation fault).
> | ---
> `
> 
> I cannot see inside the MPI library (I don't really know if that would
> be helpful) but line 627 in pcorona_main.f90 is:
> 
> ,
> |  call mpi_probe(master,mpi_any_tag,mpi_comm_world,stat,mpierror)
> `
> 
> Any ideas/suggestions what could be going on or how to try an get some
> more clues about the possible causes of this?
> 
> Many thanks,
> -- 
> Ángel de Vicente
> 
> Tel.: +34 922 605 747
> Web.: http://research.iac.es/proyecto/polmag/
> -
> AVISO LEGAL: Este mensaje puede contener información confidencial y/o 
> privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido 
> por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso 
> no autorizadas del contenido de este mensaje está estrictamente prohibida. 
> Más información en: https://www.iac.es/es/responsabilidad-legal
> DISCLAIMER: This message may contain confidential and / or privileged 
> information. If you are not the final recipient or have received it in error, 
> please notify the sender immediately. Any unauthorized use of the content of 
> this message is strictly prohibited. More information:  
> https://www.iac.es/en/disclaimer

-
Prof. Dr.-Ing. Rainer Keller, HS Esslingen
Studiengangkoordinator Master Angewandte Informatik
Professor für Betriebssysteme, verteilte und parallele Systeme
Fakultät Informatik und 

Re: [OMPI users] help with M1 chip macOS openMPI installation

2022-04-22 Thread Jeff Squyres (jsquyres) via users
Can you send all the information listed under "For compile problems" (please 
compress!):

https://www.open-mpi.org/community/help/

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Cici Feng via users 

Sent: Friday, April 22, 2022 5:30 AM
To: Open MPI Users
Cc: Cici Feng
Subject: Re: [OMPI users] help with M1 chip macOS openMPI installation

Hi George,

Thanks so much with the tips and I have installed Rosetta in order for my 
computer to run the Intel software. However, the same error appears as I tried 
to make the file for the OMPI and here's how it looks:

../../../../opal/threads/thread_usage.h(163): warning #266: function 
"opal_atomic_swap_ptr" declared implicitly

  OPAL_THREAD_DEFINE_ATOMIC_SWAP(void *, intptr_t, ptr)

  ^


In file included from ../../../../opal/class/opal_object.h(126),

 from ../../../../opal/dss/dss_types.h(40),

 from ../../../../opal/dss/dss.h(32),

 from pmix3x_server_north.c(27):

../../../../opal/threads/thread_usage.h(163): warning #120: return value type 
does not match the function type

  OPAL_THREAD_DEFINE_ATOMIC_SWAP(void *, intptr_t, ptr)

  ^


pmix3x_server_north.c(157): warning #266: function "opal_atomic_rmb" declared 
implicitly

  OPAL_ACQUIRE_OBJECT(opalcaddy);

  ^


  CCLD mca_pmix_pmix3x.la<http://mca_pmix_pmix3x.la>

Making all in mca/pstat/test

  CCLD mca_pstat_test.la<http://mca_pstat_test.la>

Making all in mca/rcache/grdma

  CCLD mca_rcache_grdma.la<http://mca_rcache_grdma.la>

Making all in mca/reachable/weighted

  CCLD mca_reachable_weighted.la<http://mca_reachable_weighted.la>

Making all in mca/shmem/mmap

  CCLD mca_shmem_mmap.la<http://mca_shmem_mmap.la>

Making all in mca/shmem/posix

  CCLD mca_shmem_posix.la<http://mca_shmem_posix.la>

Making all in mca/shmem/sysv

  CCLD mca_shmem_sysv.la<http://mca_shmem_sysv.la>

Making all in tools/wrappers

  CCLD opal_wrapper

Undefined symbols for architecture x86_64:

  "_opal_atomic_add_fetch_32", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_compare_exchange_strong_32", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_compare_exchange_strong_ptr", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_lock", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_lock_init", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_mb", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_rmb", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_sub_fetch_32", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_swap_32", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_swap_ptr", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_unlock", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_wmb", referenced from:

  import-atom in libopen-pal.dylib

ld: symbol(s) not found for architecture x86_64

make[2]: *** [opal_wrapper] Error 1

make[1]: *** [all-recursive] Error 1

make: *** [all-recursive] Error 1


I am not sure if the ld part affects the making process or not. Either way, 
error 1 appears as the "opal_wrapper" which I think has been the error I kept 
encoutering.

Is there any explanation to this specific error?

ps. the configure command I used is as followed, provided by the official 
website of MARE2DEM

sudo  ./configure --prefix=/opt/openmpi CC=icc CXX=icc F77=ifort FC=ifort \
lt_prog_compiler_wl_FC='-Wl,';
make all install

Thanks again,
Cici

On Thu, Apr 21, 2022 at 11:18 PM George Bosilca via users 
mailto:users@lists.open-mpi.org>> wrote:
1. I am not aware of any outstanding OMPI issues with the M1 chip that would 
prevent OMPI from compiling and running efficiently in an M1-based setup, 
assuming the compilation chain is working properly.

2. M1 supports x86 code via Rosetta, an app provided by Apple to ensure a 
smooth transition from the Intel-based to the M1-based laptop's line. I do 
recall running an OMPI compiled on my Intel laptop on my M1 laptop to test the 
performance of the Rosetta binary translator. We even had some discussions 
about this, on the mailing list (or github issues).

3. Based on your original message, and their webpage, MARE2DEM is not 
supporting any other compilation chain than Intel. As explained above, that 
might not be by itself a showstopper, because you can run x86 code on the M1 
chip, using Rosetta. However, MARE2DEM relies on MKL, the Intel Math Library, 
and that library will not run on a M1 chip.

  George.


On Thu, Apr 21, 2022 at 7:02 AM

Re: [OMPI users] Help diagnosing MPI+OpenMP application segmentation fault only when run with --bind-to none

2022-04-22 Thread Jeff Squyres (jsquyres) via users
With THREAD_FUNNELED, it means that there can only be one thread in MPI at a 
time -- and it needs to be the same thread as the one that called 
MPI_INIT_THREAD.

Is that the case in your app?

Also, what is your app doing at src/pcorona_main.f90:627?  It is making a call 
to MPI, or something else?  It might be useful to compile Open MPI (and/or 
other libraries that you're using) with -g so that you can get more meaningful 
stack traces upon error -- that might give some insight into where / why the 
failure is occurring.

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Angel de Vicente 
via users 
Sent: Friday, April 22, 2022 10:54 AM
To: Gilles Gouaillardet via users
Cc: Angel de Vicente
Subject: Re: [OMPI users] Help diagnosing MPI+OpenMP application segmentation 
fault only when run with --bind-to none

Thanks Gilles,

Gilles Gouaillardet via users  writes:

> You can first double check you
> MPI_Init_thread(..., MPI_THREAD_MULTIPLE, ...)

my code uses "mpi_thread_funneled" and OpenMPI was compiled with
MPI_THREAD_MULTIPLE support:

,
| ompi_info | grep  -i thread
|   Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, 
OMPI progress: no, ORTE progress: yes, Event lib: yes)
|FT Checkpoint support: no (checkpoint thread: no)
`

Cheers,
--
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://research.iac.es/proyecto/polmag/
-
AVISO LEGAL: Este mensaje puede contener información confidencial y/o 
privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido 
por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso no 
autorizadas del contenido de este mensaje está estrictamente prohibida. Más 
información en: https://www.iac.es/es/responsabilidad-legal
DISCLAIMER: This message may contain confidential and / or privileged 
information. If you are not the final recipient or have received it in error, 
please notify the sender immediately. Any unauthorized use of the content of 
this message is strictly prohibited. More information:  
https://www.iac.es/en/disclaimer


Re: [OMPI users] Help diagnosing MPI+OpenMP application segmentation fault only when run with --bind-to none

2022-04-22 Thread Angel de Vicente via users
Thanks Gilles,

Gilles Gouaillardet via users  writes:

> You can first double check you
> MPI_Init_thread(..., MPI_THREAD_MULTIPLE, ...)

my code uses "mpi_thread_funneled" and OpenMPI was compiled with
MPI_THREAD_MULTIPLE support:

,
| ompi_info | grep  -i thread
|   Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, 
OMPI progress: no, ORTE progress: yes, Event lib: yes)
|FT Checkpoint support: no (checkpoint thread: no)
`

Cheers,
-- 
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://research.iac.es/proyecto/polmag/
-
AVISO LEGAL: Este mensaje puede contener información confidencial y/o 
privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido 
por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso no 
autorizadas del contenido de este mensaje está estrictamente prohibida. Más 
información en: https://www.iac.es/es/responsabilidad-legal
DISCLAIMER: This message may contain confidential and / or privileged 
information. If you are not the final recipient or have received it in error, 
please notify the sender immediately. Any unauthorized use of the content of 
this message is strictly prohibited. More information:  
https://www.iac.es/en/disclaimer


Re: [OMPI users] Help diagnosing MPI+OpenMP application segmentation fault only when run with --bind-to none

2022-04-22 Thread Gilles Gouaillardet via users
You can first double check you
MPI_Init_thread(..., MPI_THREAD_MULTIPLE, ...)
And the provided level is MPI_THREAD_MULTIPLE as you requested.

Cheers,

Gilles

On Fri, Apr 22, 2022, 21:45 Angel de Vicente via users <
users@lists.open-mpi.org> wrote:

> Hello,
>
> I'm running out of ideas, and wonder if someone here could have some
> tips on how to debug a segmentation fault I'm having with my
> application [due to the nature of the problem I'm wondering if the
> problem is with OpenMPI itself rather than my app, though at this point
> I'm not leaning strongly either way].
>
> The code is hybrid MPI+OpenMP and I compile it with gcc 10.3.0 and
> OpenMPI 4.1.3.
>
> Usually I was running the code with "mpirun -np X --bind-to none [...]"
> so that the threads created by OpenMP don't get bound to a single core
> and I actually get proper speedup out of OpenMP.
>
> Now, since I introduced some changes to the code this week (though I
> have read the changes carefully a number of times, and I don't see
> anything suspicious), I now get a segmentation fault sometimes, but only
> when I run with "--bind-to none" and only in my workstation. It is not
> always with the same running configuration, but I can see some pattern,
> and the problem shows up only if I run the version compiled with OpenMP
> support and most of the times only when the number of rank*threads goes
> above 4 or so. If I run it with "--bind-to socket" all looks good all
> the time.
>
> If I run it in another server, "--bind-to none" doesn't seem to be any
> issue (I submitted the jobs many many times and not a single
> segmentation fault), but in my workstation it fails almost every time if
> using MPI+OpenMP with a handful of threads and with "--bind-to none". In
> this other server I'm running gcc 9.3.0 and OpenMPI 4.1.3.
>
> For example, setting OMP_NUM_THREADS to 1, I run the code like the
> following, and get the segmentation fault as below:
>
> ,
> | angelv@sieladon:~/.../Fe13_NL3/t~gauss+isat+istim$ mpirun -np 4
> --bind-to none  ../../../../../pcorona+openmp~gauss Fe13_NL3.params
> |  Reading control file: Fe13_NL3.params
> |   ... Control file parameters broadcasted
> |
> | [...]
> |
> |  Starting calculation loop on the line of sight
> |  Receiving results from:2
> |  Receiving results from:1
> |
> | Program received signal SIGSEGV: Segmentation fault - invalid memory
> reference.
> |
> | Backtrace for this error:
> |  Receiving results from:3
> | #0  0x7fd747e7555f in ???
> | #1  0x7fd7488778e1 in ???
> | #2  0x7fd7488667a4 in ???
> | #3  0x7fd7486fe84c in ???
> | #4  0x7fd7489aa9ce in ???
> | #5  0x414959 in __pcorona_main_MOD_main_loop._omp_fn.0
> | at src/pcorona_main.f90:627
> | #6  0x7fd74813ec75 in ???
> | #7  0x412bb0 in pcorona
> | at src/pcorona.f90:49
> | #8  0x40361c in main
> | at src/pcorona.f90:17
> |
> | [...]
> |
> |
> --
> | mpirun noticed that process rank 3 with PID 0 on node sieladon exited on
> signal 11 (Segmentation fault).
> | ---
> `
>
> I cannot see inside the MPI library (I don't really know if that would
> be helpful) but line 627 in pcorona_main.f90 is:
>
> ,
> |  call
> mpi_probe(master,mpi_any_tag,mpi_comm_world,stat,mpierror)
> `
>
> Any ideas/suggestions what could be going on or how to try an get some
> more clues about the possible causes of this?
>
> Many thanks,
> --
> Ángel de Vicente
>
> Tel.: +34 922 605 747
> Web.: http://research.iac.es/proyecto/polmag/
>
> -
> AVISO LEGAL: Este mensaje puede contener información confidencial y/o
> privilegiada. Si usted no es el destinatario final del mismo o lo ha
> recibido por error, por favor notifíquelo al remitente inmediatamente.
> Cualquier uso no autorizadas del contenido de este mensaje está
> estrictamente prohibida. Más información en:
> https://www.iac.es/es/responsabilidad-legal
> DISCLAIMER: This message may contain confidential and / or privileged
> information. If you are not the final recipient or have received it in
> error, please notify the sender immediately. Any unauthorized use of the
> content of this message is strictly prohibited. More information:
> https://www.iac.es/en/disclaimer
>


Re: [OMPI users] help with M1 chip macOS openMPI installation

2022-04-22 Thread Cici Feng via users
Hi George,

Thanks so much with the tips and I have installed Rosetta in order for my
computer to run the Intel software. However, the same error appears as I
tried to make the file for the OMPI and here's how it looks:

../../../../opal/threads/thread_usage.h(163): warning #266: function
"opal_atomic_swap_ptr" declared implicitly

  OPAL_THREAD_DEFINE_ATOMIC_SWAP(void *, intptr_t, ptr)

  ^


In file included from ../../../../opal/class/opal_object.h(126),

 from ../../../../opal/dss/dss_types.h(40),

 from ../../../../opal/dss/dss.h(32),

 from pmix3x_server_north.c(27):

../../../../opal/threads/thread_usage.h(163): warning #120: return value
type does not match the function type

  OPAL_THREAD_DEFINE_ATOMIC_SWAP(void *, intptr_t, ptr)

  ^


pmix3x_server_north.c(157): warning #266: function "opal_atomic_rmb"
declared implicitly

  OPAL_ACQUIRE_OBJECT(opalcaddy);

  ^


  CCLD mca_pmix_pmix3x.la

Making all in mca/pstat/test

  CCLD mca_pstat_test.la

Making all in mca/rcache/grdma

  CCLD mca_rcache_grdma.la

Making all in mca/reachable/weighted

  CCLD mca_reachable_weighted.la

Making all in mca/shmem/mmap

  CCLD mca_shmem_mmap.la

Making all in mca/shmem/posix

  CCLD mca_shmem_posix.la

Making all in mca/shmem/sysv

  CCLD mca_shmem_sysv.la

Making all in tools/wrappers

  CCLD opal_wrapper

Undefined symbols for architecture x86_64:

  "_opal_atomic_add_fetch_32", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_compare_exchange_strong_32", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_compare_exchange_strong_ptr", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_lock", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_lock_init", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_mb", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_rmb", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_sub_fetch_32", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_swap_32", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_swap_ptr", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_unlock", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_wmb", referenced from:

  import-atom in libopen-pal.dylib

ld: symbol(s) not found for architecture x86_64

make[2]: *** [opal_wrapper] Error 1

make[1]: *** [all-recursive] Error 1

make: *** [all-recursive] Error 1


I am not sure if the ld part affects the making process or not. Either way,
error 1 appears as the "opal_wrapper" which I think has been the error I
kept encoutering.

Is there any explanation to this specific error?

ps. the configure command I used is as followed, provided by the official
website of MARE2DEM

sudo  ./configure --prefix=/opt/openmpi CC=icc CXX=icc F77=ifort
FC=ifort \lt_prog_compiler_wl_FC='-Wl,';
make all install


Thanks again,
Cici

On Thu, Apr 21, 2022 at 11:18 PM George Bosilca via users <
users@lists.open-mpi.org> wrote:

> 1. I am not aware of any outstanding OMPI issues with the M1 chip that
> would prevent OMPI from compiling and running efficiently in an M1-based
> setup, assuming the compilation chain is working properly.
>
> 2. M1 supports x86 code via Rosetta, an app provided by Apple to ensure a
> smooth transition from the Intel-based to the M1-based laptop's line. I do
> recall running an OMPI compiled on my Intel laptop on my M1 laptop to test
> the performance of the Rosetta binary translator. We even had some
> discussions about this, on the mailing list (or github issues).
>
> 3. Based on your original message, and their webpage, MARE2DEM is not
> supporting any other compilation chain than Intel. As explained above, that
> might not be by itself a showstopper, because you can run x86 code on the
> M1 chip, using Rosetta. However, MARE2DEM relies on MKL, the Intel Math
> Library, and that library will not run on a M1 chip.
>
>   George.
>
>
> On Thu, Apr 21, 2022 at 7:02 AM Jeff Squyres (jsquyres) via users <
> users@lists.open-mpi.org> wrote:
>
>> A little more color on Gilles' answer: I believe that we had some Open
>> MPI community members work on adding M1 support to Open MPI, but Gilles is
>> absolutely correct: the underlying compiler has to support the M1, or you
>> won't get anywhere.
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>>
>> 
>> From: users  on behalf of Cici Feng
>> via users 
>> Sent: Thursday, 

Re: [OMPI users] help with M1 chip macOS openMPI installation

2022-04-21 Thread George Bosilca via users
1. I am not aware of any outstanding OMPI issues with the M1 chip that
would prevent OMPI from compiling and running efficiently in an M1-based
setup, assuming the compilation chain is working properly.

2. M1 supports x86 code via Rosetta, an app provided by Apple to ensure a
smooth transition from the Intel-based to the M1-based laptop's line. I do
recall running an OMPI compiled on my Intel laptop on my M1 laptop to test
the performance of the Rosetta binary translator. We even had some
discussions about this, on the mailing list (or github issues).

3. Based on your original message, and their webpage, MARE2DEM is not
supporting any other compilation chain than Intel. As explained above, that
might not be by itself a showstopper, because you can run x86 code on the
M1 chip, using Rosetta. However, MARE2DEM relies on MKL, the Intel Math
Library, and that library will not run on a M1 chip.

  George.


On Thu, Apr 21, 2022 at 7:02 AM Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> A little more color on Gilles' answer: I believe that we had some Open MPI
> community members work on adding M1 support to Open MPI, but Gilles is
> absolutely correct: the underlying compiler has to support the M1, or you
> won't get anywhere.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> 
> From: users  on behalf of Cici Feng via
> users 
> Sent: Thursday, April 21, 2022 6:11 AM
> To: Open MPI Users
> Cc: Cici Feng
> Subject: Re: [OMPI users] help with M1 chip macOS openMPI installation
>
> Gilles,
>
> Thank you so much for the quick response!
> openMPI installed by brew is compiled on gcc and gfortran using the
> original compilers by Apple. Now I haven't figured out how to use this gcc
> openMPI for the inversion software :(
> Given by your answer, I think I'll pause for now with the M1-intel
> compilers-openMPI route and switch to an intel cluster until someone
> figured out the M1 chip problem ~
>
> Thanks again for your help!
> Cici
>
> On Thu, Apr 21, 2022 at 5:59 PM Gilles Gouaillardet via users <
> users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:
> Cici,
>
> I do not think the Intel C compiler is able to generate native code for
> the M1 (aarch64).
> The best case scenario is it would generate code for x86_64 and then
> Rosetta would be used to translate it to aarch64 code,
> and this is a very downgraded solution.
>
> So if you really want to stick to the Intel compiler, I strongly encourage
> you to run on Intel/AMD processors.
> Otherwise, use a native compiler for aarch64, and in this case, brew is
> not a bad option.
>
>
> Cheers,
>
> Gilles
>
> On Thu, Apr 21, 2022 at 6:36 PM Cici Feng via users <
> users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:
> Hi there,
>
> I am trying to install an electromagnetic inversion software (MARE2DEM) of
> which the intel C compilers and open-MPI are considered as the
> prerequisite. However, since I am completely new to computer science and
> coding, together with some of the technical issues of the computer I am
> building all this on, I have encountered some questions with the whole
> process.
>
> The computer I am working on is a macbook pro with a M1 Max chip. Despite
> how my friends have discouraged me to keep working on my M1 laptop, I still
> want to reach out to the developers since I feel like you guys might have a
> solution.
>
> By downloading the source code of openMPI on the .org website and "sudo
> configure and make all install", I was not able to install the openMPI onto
> my computer. The error provided mentioned something about the chip is not
> supported or somewhat.
>
> I have also tried to install openMPI through homebrew using the command
> "brew install openmpi" and it worked just fine. However, since Homebrew has
> automatically set up the configuration of openMPI (it uses gcc and
> gfortran), I was not able to use my intel compilers to build openMPI which
> causes further problems in the installation of my inversion software.
>
> In conclusion, I think right now the M1 chip is the biggest problem of the
> whole installation process yet I think you guys might have some solution
> for the installation. I would assume that Apple is switching all of its
> chip to M1 which makes the shifts and changes inevitable.
>
> I would really like to hear from you with the solution of installing
> openMPI on a M1-chip macbook and I would like to thank for your time to
> read my prolong email.
>
> Thank you very much.
> Sincerely,
>
> Cici
>
>
>
>
>
>


Re: [OMPI users] help with M1 chip macOS openMPI installation

2022-04-21 Thread Jeff Squyres (jsquyres) via users
A little more color on Gilles' answer: I believe that we had some Open MPI 
community members work on adding M1 support to Open MPI, but Gilles is 
absolutely correct: the underlying compiler has to support the M1, or you won't 
get anywhere.

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Cici Feng via users 

Sent: Thursday, April 21, 2022 6:11 AM
To: Open MPI Users
Cc: Cici Feng
Subject: Re: [OMPI users] help with M1 chip macOS openMPI installation

Gilles,

Thank you so much for the quick response!
openMPI installed by brew is compiled on gcc and gfortran using the original 
compilers by Apple. Now I haven't figured out how to use this gcc openMPI for 
the inversion software :(
Given by your answer, I think I'll pause for now with the M1-intel 
compilers-openMPI route and switch to an intel cluster until someone figured 
out the M1 chip problem ~

Thanks again for your help!
Cici

On Thu, Apr 21, 2022 at 5:59 PM Gilles Gouaillardet via users 
mailto:users@lists.open-mpi.org>> wrote:
Cici,

I do not think the Intel C compiler is able to generate native code for the M1 
(aarch64).
The best case scenario is it would generate code for x86_64 and then Rosetta 
would be used to translate it to aarch64 code,
and this is a very downgraded solution.

So if you really want to stick to the Intel compiler, I strongly encourage you 
to run on Intel/AMD processors.
Otherwise, use a native compiler for aarch64, and in this case, brew is not a 
bad option.


Cheers,

Gilles

On Thu, Apr 21, 2022 at 6:36 PM Cici Feng via users 
mailto:users@lists.open-mpi.org>> wrote:
Hi there,

I am trying to install an electromagnetic inversion software (MARE2DEM) of 
which the intel C compilers and open-MPI are considered as the prerequisite. 
However, since I am completely new to computer science and coding, together 
with some of the technical issues of the computer I am building all this on, I 
have encountered some questions with the whole process.

The computer I am working on is a macbook pro with a M1 Max chip. Despite how 
my friends have discouraged me to keep working on my M1 laptop, I still want to 
reach out to the developers since I feel like you guys might have a solution.

By downloading the source code of openMPI on the .org website and "sudo 
configure and make all install", I was not able to install the openMPI onto my 
computer. The error provided mentioned something about the chip is not 
supported or somewhat.

I have also tried to install openMPI through homebrew using the command "brew 
install openmpi" and it worked just fine. However, since Homebrew has 
automatically set up the configuration of openMPI (it uses gcc and gfortran), I 
was not able to use my intel compilers to build openMPI which causes further 
problems in the installation of my inversion software.

In conclusion, I think right now the M1 chip is the biggest problem of the 
whole installation process yet I think you guys might have some solution for 
the installation. I would assume that Apple is switching all of its chip to M1 
which makes the shifts and changes inevitable.

I would really like to hear from you with the solution of installing openMPI on 
a M1-chip macbook and I would like to thank for your time to read my prolong 
email.

Thank you very much.
Sincerely,

Cici







Re: [OMPI users] help with M1 chip macOS openMPI installation

2022-04-21 Thread Cici Feng via users
Gilles,

Thank you so much for the quick response!
openMPI installed by brew is compiled on gcc and gfortran using the
original compilers by Apple. Now I haven't figured out how to use this gcc
openMPI for the inversion software :(
Given by your answer, I think I'll pause for now with the M1-intel
compilers-openMPI route and switch to an intel cluster until someone
figured out the M1 chip problem ~

Thanks again for your help!
Cici

On Thu, Apr 21, 2022 at 5:59 PM Gilles Gouaillardet via users <
users@lists.open-mpi.org> wrote:

> Cici,
>
> I do not think the Intel C compiler is able to generate native code for
> the M1 (aarch64).
> The best case scenario is it would generate code for x86_64 and then
> Rosetta would be used to translate it to aarch64 code,
> and this is a very downgraded solution.
>
> So if you really want to stick to the Intel compiler, I strongly encourage
> you to run on Intel/AMD processors.
> Otherwise, use a native compiler for aarch64, and in this case, brew is
> not a bad option.
>
>
> Cheers,
>
> Gilles
>
> On Thu, Apr 21, 2022 at 6:36 PM Cici Feng via users <
> users@lists.open-mpi.org> wrote:
>
>> Hi there,
>>
>> I am trying to install an electromagnetic inversion software (MARE2DEM)
>> of which the intel C compilers and open-MPI are considered as the
>> prerequisite. However, since I am completely new to computer science and
>> coding, together with some of the technical issues of the computer I am
>> building all this on, I have encountered some questions with the whole
>> process.
>>
>> The computer I am working on is a macbook pro with a M1 Max chip. Despite
>> how my friends have discouraged me to keep working on my M1 laptop, I still
>> want to reach out to the developers since I feel like you guys might have a
>> solution.
>>
>> By downloading the source code of openMPI on the .org website and "sudo
>> configure and make all install", I was not able to install the openMPI onto
>> my computer. The error provided mentioned something about the chip is not
>> supported or somewhat.
>>
>> I have also tried to install openMPI through homebrew using the command
>> "brew install openmpi" and it worked just fine. However, since Homebrew has
>> automatically set up the configuration of openMPI (it uses gcc and
>> gfortran), I was not able to use my intel compilers to build openMPI which
>> causes further problems in the installation of my inversion software.
>>
>> In conclusion, I think right now the M1 chip is the biggest problem of
>> the whole installation process yet I think you guys might have some
>> solution for the installation. I would assume that Apple is switching all
>> of its chip to M1 which makes the shifts and changes inevitable.
>>
>> I would really like to hear from you with the solution of installing
>> openMPI on a M1-chip macbook and I would like to thank for your time to
>> read my prolong email.
>>
>> Thank you very much.
>> Sincerely,
>>
>> Cici
>>
>>
>>
>>
>>
>>


Re: [OMPI users] help with M1 chip macOS openMPI installation

2022-04-21 Thread Gilles Gouaillardet via users
Cici,

I do not think the Intel C compiler is able to generate native code for the
M1 (aarch64).
The best case scenario is it would generate code for x86_64 and then
Rosetta would be used to translate it to aarch64 code,
and this is a very downgraded solution.

So if you really want to stick to the Intel compiler, I strongly encourage
you to run on Intel/AMD processors.
Otherwise, use a native compiler for aarch64, and in this case, brew is not
a bad option.


Cheers,

Gilles

On Thu, Apr 21, 2022 at 6:36 PM Cici Feng via users <
users@lists.open-mpi.org> wrote:

> Hi there,
>
> I am trying to install an electromagnetic inversion software (MARE2DEM) of
> which the intel C compilers and open-MPI are considered as the
> prerequisite. However, since I am completely new to computer science and
> coding, together with some of the technical issues of the computer I am
> building all this on, I have encountered some questions with the whole
> process.
>
> The computer I am working on is a macbook pro with a M1 Max chip. Despite
> how my friends have discouraged me to keep working on my M1 laptop, I still
> want to reach out to the developers since I feel like you guys might have a
> solution.
>
> By downloading the source code of openMPI on the .org website and "sudo
> configure and make all install", I was not able to install the openMPI onto
> my computer. The error provided mentioned something about the chip is not
> supported or somewhat.
>
> I have also tried to install openMPI through homebrew using the command
> "brew install openmpi" and it worked just fine. However, since Homebrew has
> automatically set up the configuration of openMPI (it uses gcc and
> gfortran), I was not able to use my intel compilers to build openMPI which
> causes further problems in the installation of my inversion software.
>
> In conclusion, I think right now the M1 chip is the biggest problem of the
> whole installation process yet I think you guys might have some solution
> for the installation. I would assume that Apple is switching all of its
> chip to M1 which makes the shifts and changes inevitable.
>
> I would really like to hear from you with the solution of installing
> openMPI on a M1-chip macbook and I would like to thank for your time to
> read my prolong email.
>
> Thank you very much.
> Sincerely,
>
> Cici
>
>
>
>
>
>


Re: [OMPI users] [Help] Must orted exit after all spawned proecesses exit

2021-05-19 Thread Ralph Castain via users
To answer your specific questions:

The backend daemons (orted) will not exit until all locally spawned procs exit. 
This is not configurable - for one thing, OMPI procs will suicide if they see 
the daemon depart, so it makes no sense to have the daemon fail if a proc 
terminates. The logic behind this behavior spans multiple parts of the code 
base, I'm afraid.

On May 17, 2021, at 7:03 AM, Jeff Squyres (jsquyres) via users 
mailto:users@lists.open-mpi.org> > wrote:

FYI: general Open MPI questions are better sent to the user's mailing list.

Up through the v4.1.x series, the "orted" is a general helper process that Open 
MPI uses on the back-end.  It will not quit until all of its children have 
died.  Open MPI's run time is designed with the intent that some external 
helper will be there for the entire duration of the job; there is no option to 
run without one.

Two caveats:

1. In Open MPI v5.0.x, from the user's perspective, "orted" has been renamed to 
be "prted".  Since this is 99.999% behind the scenes, most users won't notice 
the difference.

2. You can run without "orted" (or "prted") if you use a different run-time 
environment (e.g., SLURM).  In this case, you'll use that environment's 
launcher (e.g., srun or sbatch in SLURM environments) to directly launch MPI 
processes -- you won't use "mpirun" at all.  Fittingly, this is called "direct 
launch" in Open MPI parlance (i.e., using another run-time's daemons to launch 
processes instead of first launching orteds (or prteds).



On May 16, 2021, at 8:34 AM, 叶安华 mailto:yean...@sensetime.com> > wrote:

Code snippet: 
 # sleep.sh
sleep 10001 &
/bin/sh son_sleep.sh
sleep 10002
 # son_sleep.sh
sleep 10003 &
sleep 10004 &
 thanks
Anhua
  From: 叶安华 mailto:yean...@sensetime.com> >
Date: Sunday, May 16, 2021 at 20:31
To: "jsquy...@cisco.com  " mailto:jsquy...@cisco.com> >
Subject: [Help] Must orted exit after all spawned proecesses exit
 Dear Jeff, 
 Sorry to bother you but I am really curious about the conditions on which 
orted exits in the below scenario, and I am looking forward to hearing from you.
 Scenario description:
· Step 1: start a remote process via "mpirun -np 1 -host 10.211.55.4 sh 
sleep.sh"
· Step 2: check pstree in the remote host:

· Step 3: the mpirun process in step 1 does not exit until I kill all 
the sleeping process, which are 15479 15481 15482 15483
 To conclude, my questions are as follows:
1.  Must orted wait until all spawned processes exit?
2.  Is this behavior configurable? What if I want orted to exit immediately 
after any one of the spawned proecess exits?
3.  I did not find the specific logic about orted waiting for spawned 
proecesses to exit, hope I can get some hint from you.
 PS (scripts):

  thanks
Anhua
 

-- 
Jeff Squyres
jsquy...@cisco.com  






Re: [OMPI users] [Help] Must orted exit after all spawned proecesses exit

2021-05-17 Thread Jeff Squyres (jsquyres) via users
FYI: general Open MPI questions are better sent to the user's mailing list.

Up through the v4.1.x series, the "orted" is a general helper process that Open 
MPI uses on the back-end.  It will not quit until all of its children have 
died.  Open MPI's run time is designed with the intent that some external 
helper will be there for the entire duration of the job; there is no option to 
run without one.

Two caveats:

1. In Open MPI v5.0.x, from the user's perspective, "orted" has been renamed to 
be "prted".  Since this is 99.999% behind the scenes, most users won't notice 
the difference.

2. You can run without "orted" (or "prted") if you use a different run-time 
environment (e.g., SLURM).  In this case, you'll use that environment's 
launcher (e.g., srun or sbatch in SLURM environments) to directly launch MPI 
processes -- you won't use "mpirun" at all.  Fittingly, this is called "direct 
launch" in Open MPI parlance (i.e., using another run-time's daemons to launch 
processes instead of first launching orteds (or prteds).



On May 16, 2021, at 8:34 AM, 叶安华 
mailto:yean...@sensetime.com>> wrote:

Code snippet:

# sleep.sh
sleep 10001 &
/bin/sh son_sleep.sh
sleep 10002

# son_sleep.sh
sleep 10003 &
sleep 10004 &

thanks
Anhua


From: 叶安华 mailto:yean...@sensetime.com>>
Date: Sunday, May 16, 2021 at 20:31
To: "jsquy...@cisco.com" 
mailto:jsquy...@cisco.com>>
Subject: [Help] Must orted exit after all spawned proecesses exit

Dear Jeff,

Sorry to bother you but I am really curious about the conditions on which orted 
exits in the below scenario, and I am looking forward to hearing from you.

Scenario description:
• Step 1: start a remote process via "mpirun -np 1 -host 10.211.55.4 sh 
sleep.sh"
• Step 2: check pstree in the remote host:

• Step 3: the mpirun process in step 1 does not exit until I kill all 
the sleeping process, which are 15479 15481 15482 15483

To conclude, my questions are as follows:

  1.  Must orted wait until all spawned processes exit?
  2.  Is this behavior configurable? What if I want orted to exit immediately 
after any one of the spawned proecess exits?
  3.  I did not find the specific logic about orted waiting for spawned 
proecesses to exit, hope I can get some hint from you.


PS (scripts):



thanks
Anhua



--
Jeff Squyres
jsquy...@cisco.com





Re: [OMPI users] help

2020-12-14 Thread Lesiano 16 via users
Thanks for the answer


On Mon, Dec 14, 2020 at 4:20 PM Jeff Squyres (jsquyres) 
wrote:

> On Dec 12, 2020, at 4:58 AM, Lesiano 16 via users <
> users@lists.open-mpi.org> wrote:
> >
> > My question is, can I assume that when skipping the beginning of the
> file that MPI will fill up with zeros? Or is it implementation dependent?
> >
> > I have read the standard, but I could not found anything meaningful
> expected for:
> >
> > "Initially, all processes view the file as a linear byte stream, and
> each process views data in its own native representation (no data
> representation conversion is performed). (POSIX files are linear byte
> streams in the native representation.) The file view can be changed via the
> MPI_FILE_SET_VIEW routine."
> >
> > which I am not sure is actually relevant or not.
>
> In general, the contents of a file written by the MPI IO interface are
> going to be implementation-specific.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
>


Re: [OMPI users] help

2020-12-14 Thread Jeff Squyres (jsquyres) via users
On Dec 12, 2020, at 4:58 AM, Lesiano 16 via users  
wrote:
> 
> My question is, can I assume that when skipping the beginning of the file 
> that MPI will fill up with zeros? Or is it implementation dependent?
> 
> I have read the standard, but I could not found anything meaningful expected 
> for:
> 
> "Initially, all processes view the file as a linear byte stream, and each 
> process views data in its own native representation (no data representation 
> conversion is performed). (POSIX files are linear byte streams in the native 
> representation.) The file view can be changed via the MPI_FILE_SET_VIEW 
> routine."
> 
> which I am not sure is actually relevant or not.

In general, the contents of a file written by the MPI IO interface are going to 
be implementation-specific.

-- 
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, Fails in Open MPI

2020-02-25 Thread Adam Simpson via users
I would start by running with the docker flags I provided to see if that fixes 
the issue:

$ docker run --privileged --security-opt label=disable --security-opt 
seccomp=unconfined --security-opt apparmor=unconfined --ipc=host --network=host 
...

These flags strip away some of the security and confinement features that are 
enabled by default in Docker and are my go-to set of flags when diagnosing what 
I suspect are container related issues. If it fixes your issues you can look 
into using a finer granularity approach to stripping away security features, 
such as editing seccomp, if you'd like; It depends on the environment and how 
important the various security and confinement features are to you. If the 
flags don't fix your issue you might check for ptrace restrictions on the host 
using the sysctl commands in my previous message.

Best,
Adam

From: Matt Thompson 
Sent: Tuesday, February 25, 2020 5:54 AM
To: Adam Simpson 
Cc: Open MPI Users 
Subject: Re: [OMPI users] Help with One-Sided Communication: Works in Intel 
MPI, Fails in Open MPI

External email: Use caution opening links or attachments

Adam,

A couple questions. First, is seccomp the reason you think I have the 
MPI_THREAD_MULTIPLE error? Or is it more for the vader error? If so, the 
environment variable Nathan provided is probably enough. These are unit tests 
and should execute in seconds at most (building them takes 10x-100x more time).

But if it can help with the MPI_THREAD_MULTIPLE error, can you help translate 
that to "Fortran programmer who really can only do docker build/run/push/cp" 
for me? I found this page: https://docs.docker.com/engine/security/seccomp/ 
that I'm trying to read through and understand, but I'm mainly learning I 
should be looking at taking some Docker training soon!

On Mon, Feb 24, 2020 at 8:24 PM Adam Simpson 
mailto:asimp...@nvidia.com>> wrote:
Calls to process_vm_readv() and process_vm_writev() are disabled in the default 
Docker seccomp 
profile<https://github.com/moby/moby/blob/master/profiles/seccomp/default.json>.
 You can add the docker flag --cap-add=SYS_PTRACE or better yet modify the 
seccomp profile so that process_vm_readv and process_vm_writev are whitelisted, 
by adding them to the syscalls.names list.

You can also disable seccomp, and several other confinement and security 
features, if you prefer a heavy handed approach:

$ docker run --privileged --security-opt label=disable --security-opt 
seccomp=unconfined --security-opt apparmor=unconfined --ipc=host --network=host 
...

If you're still having trouble after fixing the above you may need to check 
yama on the host. You can check with "sysctl -w kernel.yama.ptrace_scope", if 
it returns a value other than 0 you may need to disable it with "sysctl -w 
kernel.yama.ptrace_scope=0".

Adam


From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of Matt Thompson via users 
mailto:users@lists.open-mpi.org>>
Sent: Monday, February 24, 2020 5:15 PM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Matt Thompson mailto:fort...@gmail.com>>
Subject: Re: [OMPI users] Help with One-Sided Communication: Works in Intel 
MPI, Fails in Open MPI

External email: Use caution opening links or attachments

Nathan,

The reproducer would be that code that's on the Intel website. That is what I 
was running. You could pull my image if you like but...since you are the genius:

[root@adac3ce0cf32 ~]# mpirun --mca btl_vader_single_copy_mechanism none -np 2 
./a.out
Rank 0 running on adac3ce0cf32
Rank 1 running on adac3ce0cf32
Rank 0 sets data in the shared memory: 00 01 02 03
Rank 1 sets data in the shared memory: 10 11 12 13
Rank 0 gets data from the shared memory: 10 11 12 13
Rank 0 has new data in the shared memory: 00 01 02 03
Rank 1 gets data from the shared memory: 00 01 02 03
Rank 1 has new data in the shared memory: 10 11 12 13

And knowing this led to: https://github.com/open-mpi/ompi/issues/4948

So, good news is that setting export 
OMPI_MCA_btl_vader_single_copy_mechanism=none let's a lot of stuff work. The 
bad news is we seem to be using MPI_THREAD_MULTIPLE and it does not like it:

Start 2: pFIO_tests_mpi

2: Test command: /opt/openmpi-4.0.2/bin/mpiexec "-n" "18" "-oversubscribe" 
"/root/project/MAPL/build/bin/pfio_ctest_io.x" "-nc" "6" "-nsi" "6" "-nso" "6" 
"-ngo" "1" "-ngi" "1" "-v" "T,U" "-s" "mpi"
2: Test timeout computed to be: 1500
2: --
2: The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this release.
2: Workarounds are to run on a single node, or to use a system with an RDMA
2: capable network such as Infiniband.
2: ---

Re: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, Fails in Open MPI

2020-02-25 Thread Matt Thompson via users
Adam,

A couple questions. First, is seccomp the reason you think I have the
MPI_THREAD_MULTIPLE error? Or is it more for the vader error? If so, the
environment variable Nathan provided is probably enough. These are unit
tests and should execute in seconds at most (building them takes 10x-100x
more time).

But if it can help with the MPI_THREAD_MULTIPLE error, can you help
translate that to "Fortran programmer who really can only do docker
build/run/push/cp" for me? I found this page:
https://docs.docker.com/engine/security/seccomp/ that I'm trying to read
through and understand, but I'm mainly learning I should be looking at
taking some Docker training soon!

On Mon, Feb 24, 2020 at 8:24 PM Adam Simpson  wrote:

> Calls to process_vm_readv() and process_vm_writev() are disabled in the
> default Docker seccomp profile
> <https://github.com/moby/moby/blob/master/profiles/seccomp/default.json>.
> You can add the docker flag --cap-add=SYS_PTRACE or better yet modify the
> seccomp profile so that process_vm_readv and process_vm_writev are
> whitelisted, by adding them to the syscalls.names list.
>
> You can also disable seccomp, and several other confinement and security
> features, if you prefer a heavy handed approach:
>
> $ docker run --privileged --security-opt label=disable --security-opt
> seccomp=unconfined --security-opt apparmor=unconfined --ipc=host
> --network=host ...
>
> If you're still having trouble after fixing the above you may need to
> check yama on the host. You can check with "sysctl -w
> kernel.yama.ptrace_scope", if it returns a value other than 0 you may
> need to disable it with "sysctl -w kernel.yama.ptrace_scope=0".
>
> Adam
>
> --
> *From:* users  on behalf of Matt
> Thompson via users 
> *Sent:* Monday, February 24, 2020 5:15 PM
> *To:* Open MPI Users 
> *Cc:* Matt Thompson 
> *Subject:* Re: [OMPI users] Help with One-Sided Communication: Works in
> Intel MPI, Fails in Open MPI
>
> *External email: Use caution opening links or attachments*
> Nathan,
>
> The reproducer would be that code that's on the Intel website. That is
> what I was running. You could pull my image if you like but...since you are
> the genius:
>
> [root@adac3ce0cf32 ~]# mpirun --mca btl_vader_single_copy_mechanism none
> -np 2 ./a.out
>
> Rank 0 running on adac3ce0cf32
> Rank 1 running on adac3ce0cf32
> Rank 0 sets data in the shared memory: 00 01 02 03
> Rank 1 sets data in the shared memory: 10 11 12 13
> Rank 0 gets data from the shared memory: 10 11 12 13
> Rank 0 has new data in the shared memory: 00 01 02 03
> Rank 1 gets data from the shared memory: 00 01 02 03
> Rank 1 has new data in the shared memory: 10 11 12 13
>
> And knowing this led to: https://github.com/open-mpi/ompi/issues/4948
>
> So, good news is that setting export
> OMPI_MCA_btl_vader_single_copy_mechanism=none let's a lot of stuff work.
> The bad news is we seem to be using MPI_THREAD_MULTIPLE and it does not
> like it:
>
> Start 2: pFIO_tests_mpi
>
> 2: Test command: /opt/openmpi-4.0.2/bin/mpiexec "-n" "18" "-oversubscribe"
> "/root/project/MAPL/build/bin/pfio_ctest_io.x" "-nc" "6" "-nsi" "6" "-nso"
> "6" "-ngo" "1" "-ngi" "1" "-v" "T,U" "-s" "mpi"
> 2: Test timeout computed to be: 1500
> 2:
> --
> 2: The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this
> release.
> 2: Workarounds are to run on a single node, or to use a system with an RDMA
> 2: capable network such as Infiniband.
> 2:
> --
> 2: [adac3ce0cf32:03619] *** An error occurred in MPI_Win_create
> 2: [adac3ce0cf32:03619] *** reported by process [270073857,16]
> 2: [adac3ce0cf32:03619] *** on communicator MPI COMMUNICATOR 4 DUP FROM 3
> 2: [adac3ce0cf32:03619] *** MPI_ERR_WIN: invalid window
> 2: [adac3ce0cf32:03619] *** MPI_ERRORS_ARE_FATAL (processes in this
> communicator will now abort,
> 2: [adac3ce0cf32:03619] ***and potentially your MPI job)
> 2: [adac3ce0cf32:03587] 17 more processes have sent help message
> help-osc-pt2pt.txt / mpi-thread-multiple-not-supported
> 2: [adac3ce0cf32:03587] Set MCA parameter "orte_base_help_aggregate" to 0
> to see all help / error messages
> 2: [adac3ce0cf32:03587] 17 more processes have sent help message
> help-mpi-errors.txt / mpi_errors_are_fatal
> 2/5 Test #2: pFIO_tests_mpi ...***Failed0.18 sec
>
> 40% tests passed, 3 tests failed out of 

Re: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, Fails in Open MPI

2020-02-24 Thread Adam Simpson via users
Calls to process_vm_readv() and process_vm_writev() are disabled in the default 
Docker seccomp 
profile<https://github.com/moby/moby/blob/master/profiles/seccomp/default.json>.
 You can add the docker flag --cap-add=SYS_PTRACE or better yet modify the 
seccomp profile so that process_vm_readv and process_vm_writev are whitelisted, 
by adding them to the syscalls.names list.

You can also disable seccomp, and several other confinement and security 
features, if you prefer a heavy handed approach:

$ docker run --privileged --security-opt label=disable --security-opt 
seccomp=unconfined --security-opt apparmor=unconfined --ipc=host --network=host 
...

If you're still having trouble after fixing the above you may need to check 
yama on the host. You can check with "sysctl -w kernel.yama.ptrace_scope", if 
it returns a value other than 0 you may need to disable it with "sysctl -w 
kernel.yama.ptrace_scope=0".

Adam


From: users  on behalf of Matt Thompson via 
users 
Sent: Monday, February 24, 2020 5:15 PM
To: Open MPI Users 
Cc: Matt Thompson 
Subject: Re: [OMPI users] Help with One-Sided Communication: Works in Intel 
MPI, Fails in Open MPI

External email: Use caution opening links or attachments

Nathan,

The reproducer would be that code that's on the Intel website. That is what I 
was running. You could pull my image if you like but...since you are the genius:

[root@adac3ce0cf32 ~]# mpirun --mca btl_vader_single_copy_mechanism none -np 2 
./a.out
Rank 0 running on adac3ce0cf32
Rank 1 running on adac3ce0cf32
Rank 0 sets data in the shared memory: 00 01 02 03
Rank 1 sets data in the shared memory: 10 11 12 13
Rank 0 gets data from the shared memory: 10 11 12 13
Rank 0 has new data in the shared memory: 00 01 02 03
Rank 1 gets data from the shared memory: 00 01 02 03
Rank 1 has new data in the shared memory: 10 11 12 13

And knowing this led to: https://github.com/open-mpi/ompi/issues/4948

So, good news is that setting export 
OMPI_MCA_btl_vader_single_copy_mechanism=none let's a lot of stuff work. The 
bad news is we seem to be using MPI_THREAD_MULTIPLE and it does not like it:

Start 2: pFIO_tests_mpi

2: Test command: /opt/openmpi-4.0.2/bin/mpiexec "-n" "18" "-oversubscribe" 
"/root/project/MAPL/build/bin/pfio_ctest_io.x" "-nc" "6" "-nsi" "6" "-nso" "6" 
"-ngo" "1" "-ngi" "1" "-v" "T,U" "-s" "mpi"
2: Test timeout computed to be: 1500
2: --
2: The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this release.
2: Workarounds are to run on a single node, or to use a system with an RDMA
2: capable network such as Infiniband.
2: --
2: [adac3ce0cf32:03619] *** An error occurred in MPI_Win_create
2: [adac3ce0cf32:03619] *** reported by process [270073857,16]
2: [adac3ce0cf32:03619] *** on communicator MPI COMMUNICATOR 4 DUP FROM 3
2: [adac3ce0cf32:03619] *** MPI_ERR_WIN: invalid window
2: [adac3ce0cf32:03619] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
2: [adac3ce0cf32:03619] ***and potentially your MPI job)
2: [adac3ce0cf32:03587] 17 more processes have sent help message 
help-osc-pt2pt.txt / mpi-thread-multiple-not-supported
2: [adac3ce0cf32:03587] Set MCA parameter "orte_base_help_aggregate" to 0 to 
see all help / error messages
2: [adac3ce0cf32:03587] 17 more processes have sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal
2/5 Test #2: pFIO_tests_mpi ...***Failed0.18 sec

40% tests passed, 3 tests failed out of 5

Total Test time (real) =   1.08 sec

The following tests FAILED:
  2 - pFIO_tests_mpi (Failed)
  3 - pFIO_tests_simple (Failed)
  4 - pFIO_tests_hybrid (Failed)
Errors while running CTest

The weird thing is, I *am* running on one node (it's all I have, I'm not fancy 
enough at AWS to try more yet) and ompi_info does mention MPI_THREAD_MULTIPLE:

[root@adac3ce0cf32 build]# ompi_info | grep -i mult
  Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, 
OMPI progress: no, ORTE progress: yes, Event lib: yes)

Any ideas on this one?

On Mon, Feb 24, 2020 at 7:24 PM Nathan Hjelm via users 
mailto:users@lists.open-mpi.org>> wrote:
The error is from btl/vader. CMA is not functioning as expected. It might work 
if you set btl_vader_single_copy_mechanism=none

Performance will suffer though. It would be worth understanding with 
process_readv is failing.

Can you send a simple reproducer?

-Nathan

On Feb 24, 2020, at 2:59 PM, Gabriel, Edgar via users 
mailto:users@lists.open-mpi.org>> wrote:



I am not an expert for the one-sided code in Open MPI, I wanted to c

Re: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, Fails in Open MPI

2020-02-24 Thread Matt Thompson via users
Nathan,

The reproducer would be that code that's on the Intel website. That is what
I was running. You could pull my image if you like but...since you are the
genius:

[root@adac3ce0cf32 ~]# mpirun --mca btl_vader_single_copy_mechanism none
-np 2 ./a.out

Rank 0 running on adac3ce0cf32
Rank 1 running on adac3ce0cf32
Rank 0 sets data in the shared memory: 00 01 02 03
Rank 1 sets data in the shared memory: 10 11 12 13
Rank 0 gets data from the shared memory: 10 11 12 13
Rank 0 has new data in the shared memory: 00 01 02 03
Rank 1 gets data from the shared memory: 00 01 02 03
Rank 1 has new data in the shared memory: 10 11 12 13

And knowing this led to: https://github.com/open-mpi/ompi/issues/4948

So, good news is that setting export
OMPI_MCA_btl_vader_single_copy_mechanism=none let's a lot of stuff work.
The bad news is we seem to be using MPI_THREAD_MULTIPLE and it does not
like it:

Start 2: pFIO_tests_mpi

2: Test command: /opt/openmpi-4.0.2/bin/mpiexec "-n" "18" "-oversubscribe"
"/root/project/MAPL/build/bin/pfio_ctest_io.x" "-nc" "6" "-nsi" "6" "-nso"
"6" "-ngo" "1" "-ngi" "1" "-v" "T,U" "-s" "mpi"
2: Test timeout computed to be: 1500
2:
--
2: The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this
release.
2: Workarounds are to run on a single node, or to use a system with an RDMA
2: capable network such as Infiniband.
2:
--
2: [adac3ce0cf32:03619] *** An error occurred in MPI_Win_create
2: [adac3ce0cf32:03619] *** reported by process [270073857,16]
2: [adac3ce0cf32:03619] *** on communicator MPI COMMUNICATOR 4 DUP FROM 3
2: [adac3ce0cf32:03619] *** MPI_ERR_WIN: invalid window
2: [adac3ce0cf32:03619] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
2: [adac3ce0cf32:03619] ***and potentially your MPI job)
2: [adac3ce0cf32:03587] 17 more processes have sent help message
help-osc-pt2pt.txt / mpi-thread-multiple-not-supported
2: [adac3ce0cf32:03587] Set MCA parameter "orte_base_help_aggregate" to 0
to see all help / error messages
2: [adac3ce0cf32:03587] 17 more processes have sent help message
help-mpi-errors.txt / mpi_errors_are_fatal
2/5 Test #2: pFIO_tests_mpi ...***Failed0.18 sec

40% tests passed, 3 tests failed out of 5

Total Test time (real) =   1.08 sec

The following tests FAILED:
  2 - pFIO_tests_mpi (Failed)
  3 - pFIO_tests_simple (Failed)
  4 - pFIO_tests_hybrid (Failed)
Errors while running CTest

The weird thing is, I *am* running on one node (it's all I have, I'm not
fancy enough at AWS to try more yet) and ompi_info does mention
MPI_THREAD_MULTIPLE:

[root@adac3ce0cf32 build]# ompi_info | grep -i mult
  Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support:
yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)

Any ideas on this one?

On Mon, Feb 24, 2020 at 7:24 PM Nathan Hjelm via users <
users@lists.open-mpi.org> wrote:

> The error is from btl/vader. CMA is not functioning as expected. It might
> work if you set btl_vader_single_copy_mechanism=none
>
> Performance will suffer though. It would be worth understanding with
> process_readv is failing.
>
> Can you send a simple reproducer?
>
> -Nathan
>
> On Feb 24, 2020, at 2:59 PM, Gabriel, Edgar via users <
> users@lists.open-mpi.org> wrote:
>
> 
>
> I am not an expert for the one-sided code in Open MPI, I wanted to comment
> briefly on the potential MPI -IO related item. As far as I can see, the
> error message
>
>
>
> “Read -1, expected 48, errno = 1”
>
> does not stem from MPI I/O, at least not from the ompio library. What file
> system did you use for these tests?
>
>
>
> Thanks
>
> Edgar
>
>
>
> *From:* users  *On Behalf Of *Matt
> Thompson via users
> *Sent:* Monday, February 24, 2020 1:20 PM
> *To:* users@lists.open-mpi.org
> *Cc:* Matt Thompson 
> *Subject:* [OMPI users] Help with One-Sided Communication: Works in Intel
> MPI, Fails in Open MPI
>
>
>
> All,
>
>
>
> My guess is this is a "I built Open MPI incorrectly" sort of issue, but
> I'm not sure how to fix it. Namely, I'm currently trying to get an MPI
> project's CI working on CircleCI using Open MPI to run some unit tests (on
> a single node, so need some oversubscribe). I can build everything just
> fine, but when I try to run, things just...blow up:
>
>
>
> [root@3796b115c961 build]# /opt/openmpi-4.0.2/bin/mpirun -np 18
> -oversubscribe /root/project/MAPL/build/bin/pfio_ctest_io.x -nc 6 -nsi 6
> -nso 6 -ngo 1 -ngi 1 -v T,U -s mpi
>  start app rank:   0
>  start app rank:   1
>  start app rank:   2
>  start app rank:   3
>  start app rank:   4
>  start app rank:   5
> [3796b115c961:03629] Read -1, expected 48, errno = 1
> [3796b115c961:03629] *** An error occurred in MPI_Get
> [3796b115c961:03629] *** reported by process [2144600065,12]
> [3796b115c961:03629] *** on 

Re: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, Fails in Open MPI

2020-02-24 Thread Matt Thompson via users
On Mon, Feb 24, 2020 at 4:57 PM Gabriel, Edgar 
wrote:

> I am not an expert for the one-sided code in Open MPI, I wanted to comment
> briefly on the potential MPI -IO related item. As far as I can see, the
> error message
>
>
>
> “Read -1, expected 48, errno = 1”
>
> does not stem from MPI I/O, at least not from the ompio library. What file
> system did you use for these tests?
>

I am not sure. It was happening in a Docker image running on an AWS EC2
instance, so I guess whatever ebs is? I'm sort of a neophyte at both AWS
and Docker, so combine the two and...

Matt


Re: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, Fails in Open MPI

2020-02-24 Thread Nathan Hjelm via users
The error is from btl/vader. CMA is not functioning as expected. It might work 
if you set btl_vader_single_copy_mechanism=none

Performance will suffer though. It would be worth understanding with 
process_readv is failing.

Can you send a simple reproducer?

-Nathan

> On Feb 24, 2020, at 2:59 PM, Gabriel, Edgar via users 
>  wrote:
> 
> 
> I am not an expert for the one-sided code in Open MPI, I wanted to comment 
> briefly on the potential MPI -IO related item. As far as I can see, the error 
> message
>  
> “Read -1, expected 48, errno = 1” 
> 
> does not stem from MPI I/O, at least not from the ompio library. What file 
> system did you use for these tests?
>  
> Thanks
> Edgar
>  
> From: users  On Behalf Of Matt Thompson via 
> users
> Sent: Monday, February 24, 2020 1:20 PM
> To: users@lists.open-mpi.org
> Cc: Matt Thompson 
> Subject: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, 
> Fails in Open MPI
>  
> All,
>  
> My guess is this is a "I built Open MPI incorrectly" sort of issue, but I'm 
> not sure how to fix it. Namely, I'm currently trying to get an MPI project's 
> CI working on CircleCI using Open MPI to run some unit tests (on a single 
> node, so need some oversubscribe). I can build everything just fine, but when 
> I try to run, things just...blow up:
>  
> [root@3796b115c961 build]# /opt/openmpi-4.0.2/bin/mpirun -np 18 
> -oversubscribe /root/project/MAPL/build/bin/pfio_ctest_io.x -nc 6 -nsi 6 -nso 
> 6 -ngo 1 -ngi 1 -v T,U -s mpi
>  start app rank:   0
>  start app rank:   1
>  start app rank:   2
>  start app rank:   3
>  start app rank:   4
>  start app rank:   5
> [3796b115c961:03629] Read -1, expected 48, errno = 1
> [3796b115c961:03629] *** An error occurred in MPI_Get
> [3796b115c961:03629] *** reported by process [2144600065,12]
> [3796b115c961:03629] *** on win rdma window 5
> [3796b115c961:03629] *** MPI_ERR_OTHER: known error not in list
> [3796b115c961:03629] *** MPI_ERRORS_ARE_FATAL (processes in this win will now 
> abort,
> [3796b115c961:03629] ***and potentially your MPI job)
>  
> I'm currently more concerned about the MPI_Get error, though I'm not sure 
> what that "Read -1, expected 48, errno = 1" bit is about (MPI-IO error?). Now 
> this code is fairly fancy MPI code, so I decided to try a simpler one. 
> Searched the internet and found an example program here:
>  
> https://software.intel.com/en-us/blogs/2014/08/06/one-sided-communication
>  
> and when I build and run with Intel MPI it works:
>  
> (1027)(master) $ mpirun -V
> Intel(R) MPI Library for Linux* OS, Version 2018 Update 4 Build 20180823 (id: 
> 18555)
> Copyright 2003-2018 Intel Corporation.
> (1028)(master) $ mpiicc rma_test.c
> (1029)(master) $ mpirun -np 2 ./a.out
> srun.slurm: cluster configuration lacks support for cpu binding
> Rank 0 running on borgj001
> Rank 1 running on borgj001
> Rank 0 sets data in the shared memory: 00 01 02 03
> Rank 1 sets data in the shared memory: 10 11 12 13
> Rank 0 gets data from the shared memory: 10 11 12 13
> Rank 1 gets data from the shared memory: 00 01 02 03
> Rank 0 has new data in the shared memory:Rank 1 has new data in the shared 
> memory: 10 11 12 13
>  00 01 02 03
>  
> So, I have some confidence it was written correctly. Now on the same system I 
> try with Open MPI (building with gcc, not Intel C):
>  
> (1032)(master) $ mpirun -V
> mpirun (Open MPI) 4.0.1
> 
> Report bugs to http://www.open-mpi.org/community/help/
> (1033)(master) $ mpicc rma_test.c
> (1034)(master) $ mpirun -np 2 ./a.out
> Rank 0 running on borgj001
> Rank 1 running on borgj001
> Rank 0 sets data in the shared memory: 00 01 02 03
> Rank 1 sets data in the shared memory: 10 11 12 13
> [borgj001:22668] *** An error occurred in MPI_Get
> [borgj001:22668] *** reported by process [2514223105,1]
> [borgj001:22668] *** on win rdma window 3
> [borgj001:22668] *** MPI_ERR_RMA_RANGE: invalid RMA address range
> [borgj001:22668] *** MPI_ERRORS_ARE_FATAL (processes in this win will now 
> abort,
> [borgj001:22668] ***and potentially your MPI job)
> [borgj001:22642] 1 more process has sent help message help-mpi-errors.txt / 
> mpi_errors_are_fatal
> [borgj001:22642] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
> help / error messages
>  
> This is a similar failure to above. Any ideas what I might be doing wrong 
> here? I don't doubt I'm missing something, but I'm not sure what. Open MPI 
> was built pretty boringly:
>  
> Configure command line: '--with-slurm' '--enable-shared' 
> '--disable-wrapper-rpath' '--disable-wrapper-runpath' 
> '--enable-mca-no-build=btl-usnic' '--prefix=...'
>  
> And I'm not sure if we need those disable-wrapper bits anymore, but long ago 
> we needed them, and so they've lived on in "how to build" READMEs until 
> something breaks. This btl-usnic is a bit unknown to me (this was built by 
> sysadmins on a cluster), but this is pretty close to how I build on my 
> 

Re: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, Fails in Open MPI

2020-02-24 Thread Gabriel, Edgar via users
I am not an expert for the one-sided code in Open MPI, I wanted to comment 
briefly on the potential MPI -IO related item. As far as I can see, the error 
message

“Read -1, expected 48, errno = 1”

does not stem from MPI I/O, at least not from the ompio library. What file 
system did you use for these tests?

Thanks
Edgar

From: users  On Behalf Of Matt Thompson via 
users
Sent: Monday, February 24, 2020 1:20 PM
To: users@lists.open-mpi.org
Cc: Matt Thompson 
Subject: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, 
Fails in Open MPI

All,

My guess is this is a "I built Open MPI incorrectly" sort of issue, but I'm not 
sure how to fix it. Namely, I'm currently trying to get an MPI project's CI 
working on CircleCI using Open MPI to run some unit tests (on a single node, so 
need some oversubscribe). I can build everything just fine, but when I try to 
run, things just...blow up:

[root@3796b115c961 build]# /opt/openmpi-4.0.2/bin/mpirun -np 18 -oversubscribe 
/root/project/MAPL/build/bin/pfio_ctest_io.x -nc 6 -nsi 6 -nso 6 -ngo 1 -ngi 1 
-v T,U -s mpi
 start app rank:   0
 start app rank:   1
 start app rank:   2
 start app rank:   3
 start app rank:   4
 start app rank:   5
[3796b115c961:03629] Read -1, expected 48, errno = 1
[3796b115c961:03629] *** An error occurred in MPI_Get
[3796b115c961:03629] *** reported by process [2144600065,12]
[3796b115c961:03629] *** on win rdma window 5
[3796b115c961:03629] *** MPI_ERR_OTHER: known error not in list
[3796b115c961:03629] *** MPI_ERRORS_ARE_FATAL (processes in this win will now 
abort,
[3796b115c961:03629] ***and potentially your MPI job)

I'm currently more concerned about the MPI_Get error, though I'm not sure what 
that "Read -1, expected 48, errno = 1" bit is about (MPI-IO error?). Now this 
code is fairly fancy MPI code, so I decided to try a simpler one. Searched the 
internet and found an example program here:

https://software.intel.com/en-us/blogs/2014/08/06/one-sided-communication

and when I build and run with Intel MPI it works:

(1027)(master) $ mpirun -V
Intel(R) MPI Library for Linux* OS, Version 2018 Update 4 Build 20180823 (id: 
18555)
Copyright 2003-2018 Intel Corporation.
(1028)(master) $ mpiicc rma_test.c
(1029)(master) $ mpirun -np 2 ./a.out
srun.slurm: cluster configuration lacks support for cpu binding
Rank 0 running on borgj001
Rank 1 running on borgj001
Rank 0 sets data in the shared memory: 00 01 02 03
Rank 1 sets data in the shared memory: 10 11 12 13
Rank 0 gets data from the shared memory: 10 11 12 13
Rank 1 gets data from the shared memory: 00 01 02 03
Rank 0 has new data in the shared memory:Rank 1 has new data in the shared 
memory: 10 11 12 13
 00 01 02 03

So, I have some confidence it was written correctly. Now on the same system I 
try with Open MPI (building with gcc, not Intel C):

(1032)(master) $ mpirun -V
mpirun (Open MPI) 4.0.1

Report bugs to http://www.open-mpi.org/community/help/
(1033)(master) $ mpicc rma_test.c
(1034)(master) $ mpirun -np 2 ./a.out
Rank 0 running on borgj001
Rank 1 running on borgj001
Rank 0 sets data in the shared memory: 00 01 02 03
Rank 1 sets data in the shared memory: 10 11 12 13
[borgj001:22668] *** An error occurred in MPI_Get
[borgj001:22668] *** reported by process [2514223105,1]
[borgj001:22668] *** on win rdma window 3
[borgj001:22668] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[borgj001:22668] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[borgj001:22668] ***and potentially your MPI job)
[borgj001:22642] 1 more process has sent help message help-mpi-errors.txt / 
mpi_errors_are_fatal
[borgj001:22642] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
help / error messages

This is a similar failure to above. Any ideas what I might be doing wrong here? 
I don't doubt I'm missing something, but I'm not sure what. Open MPI was built 
pretty boringly:

Configure command line: '--with-slurm' '--enable-shared' 
'--disable-wrapper-rpath' '--disable-wrapper-runpath' 
'--enable-mca-no-build=btl-usnic' '--prefix=...'

And I'm not sure if we need those disable-wrapper bits anymore, but long ago we 
needed them, and so they've lived on in "how to build" READMEs until something 
breaks. This btl-usnic is a bit unknown to me (this was built by sysadmins on a 
cluster), but this is pretty close to how I build on my desktop and it has the 
same issue.

Any ideas from the experts?

--
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna 
Rampton


Re: [OMPI users] HELP: openmpi is not using the specified infiniband interface !!

2020-01-14 Thread George Bosilca via users
According to the error message you are using MPICH not Open MPI.

  George.


On Tue, Jan 14, 2020 at 5:53 PM SOPORTE MODEMAT via users <
users@lists.open-mpi.org> wrote:

> Hello everyone.
>
>
>
> I would like somebody help me to figure out how can I make that the
> openmpi use the infiniband interface that I specify with the command:
>
>
>
> /opt/mpi/openmpi_intel-2.1.1/bin/mpirun --mca btl self,openib python
> mpi_hola.py
>
>
>
> But when I  print the hostname and the ip address of the interface by the
> python script, it shows the ethernet interface:
>
>
>
> # Ejemplo de mpi4py
>
> # Funcion 'Hola mundo'
>
>
>
> from mpi4py import MPI
>
> import socket
>
>
>
> ##print(socket.gethostname())
>
>
>
> comm = MPI.COMM_WORLD
>
> rank = comm.Get_rank()
>
> size = comm.Get_size()
>
>
>
> print('Hola mundo')
>
> print('Proceso {} de {}'.format(rank, size))
>
>
>
> host_name = socket.gethostname()
>
> host_ip = socket.gethostbyname(host_name)
>
> print("Hostname :  ",host_name)
>
> print("IP : ",host_ip)
>
>
>
> The output is:
>
>
>
>
>
> Hola mundo
>
> Proceso 0 de 1
>
> Hostname :   apollo-2.private
>
> IP :  10.50.1.253
>
> Hostname :   apollo-2.private
>
> IP :  10.50.1.253
>
> Hostname :   apollo-2.private
>
> IP :  10.50.1.253
>
> Hostname :   apollo-2.private
>
> IP :  10.50.1.253
>
>
>
> But the node has already the infiniband interface configured on ib0 with
> another network.
>
>
>
> I would like you to give me some advice to make this script use the
> infiniband interface that is specified via mpi.
>
>
>
> When I run:
>
>
>
> mpirun --mca btl self,openib ./a.out
>
>
>
>
>
> I get this error that confirm that mpi is using the ethernet interface:
>
>
>
> [mpiexec@apollo-2.private] match_arg (./utils/args/args.c:122):
> unrecognized argument mca
>
> [mpiexec@apollo-2.private] HYDU_parse_array (./utils/args/args.c:140):
> argument matching returned error
>
> [mpiexec@apollo-2.private] parse_args (./ui/mpich/utils.c:1387): error
> parsing input array
>
> [mpiexec@apollo-2.private] HYD_uii_mpx_get_parameters
> (./ui/mpich/utils.c:1438): unable to parse user arguments
>
>
>
> Usage: ./mpiexec [global opts] [exec1 local opts] : [exec2 local opts] :
> ...
>
>
>
> Global options (passed to all executables):
>
>
>
>
>
> Additional Information:
>
>
>
> ll /sys/class/infiniband: mlx5_0
>
>
>
> /sys/class/net/
>
> total 0
>
> lrwxrwxrwx 1 root root 0 Jan 14 12:03 eno49 ->
> ../../devices/pci:00/:00:02.0/:07:00.0/net/eno49
>
> lrwxrwxrwx 1 root root 0 Jan 14 17:17 eno50 ->
> ../../devices/pci:00/:00:02.0/:07:00.1/net/eno50
>
> lrwxrwxrwx 1 root root 0 Jan 14 17:17 eno51 ->
> ../../devices/pci:00/:00:02.0/:07:00.2/net/eno51
>
> lrwxrwxrwx 1 root root 0 Jan 14 17:17 eno52 ->
> ../../devices/pci:00/:00:02.0/:07:00.3/net/eno52
>
> lrwxrwxrwx 1 root root 0 Jan 14 17:17 ib0 ->
> ../../devices/pci:00/:00:01.0/:06:00.0/net/ib0
>
> lrwxrwxrwx 1 root root 0 Jan 14 17:17 ib1 ->
> ../../devices/pci:00/:00:01.0/:06:00.0/net/ib1
>
> lrwxrwxrwx 1 root root 0 Jan 14 17:17 lo -> ../../devices/virtual/net/lo
>
>
>
>
>
> The operating system is Linux Centos 7.
>
>
>
>
>
> Thank you in advance for your help.
>
>
>
> Kind regards.
>
>
>
> Soporte.modemat
>


Re: [OMPI users] HELP: openmpi is not using the specified infiniband interface !!

2020-01-14 Thread Gilles Gouaillardet via users
Soporte,

The error message is from MPICH!

If you intend to use Open MPI, fix your environment first

Cheers,

Gilles


Sent from my iPod

> On Jan 15, 2020, at 7:53, SOPORTE MODEMAT via users 
>  wrote:
> 
> Hello everyone.
>  
> I would like somebody help me to figure out how can I make that the openmpi 
> use the infiniband interface that I specify with the command:
>  
> /opt/mpi/openmpi_intel-2.1.1/bin/mpirun --mca btl self,openib python 
> mpi_hola.py
>  
> But when I  print the hostname and the ip address of the interface by the 
> python script, it shows the ethernet interface:
>  
> # Ejemplo de mpi4py
> # Funcion 'Hola mundo'
>  
> from mpi4py import MPI
> import socket
>  
> ##print(socket.gethostname())
>  
> comm = MPI.COMM_WORLD
> rank = comm.Get_rank()
> size = comm.Get_size()
>  
> print('Hola mundo')
> print('Proceso {} de {}'.format(rank, size))
>  
> host_name = socket.gethostname()
> host_ip = socket.gethostbyname(host_name)
> print("Hostname :  ",host_name)
> print("IP : ",host_ip)
>  
> The output is:
>  
>  
> Hola mundo
> Proceso 0 de 1
> Hostname :   apollo-2.private
> IP :  10.50.1.253
> Hostname :   apollo-2.private
> IP :  10.50.1.253
> Hostname :   apollo-2.private
> IP :  10.50.1.253
> Hostname :   apollo-2.private
> IP :  10.50.1.253
>  
> But the node has already the infiniband interface configured on ib0 with 
> another network.
>  
> I would like you to give me some advice to make this script use the 
> infiniband interface that is specified via mpi.
>  
> When I run:
>  
> mpirun --mca btl self,openib ./a.out
>  
>  
> I get this error that confirm that mpi is using the ethernet interface:
>  
> [mpiexec@apollo-2.private] match_arg (./utils/args/args.c:122): unrecognized 
> argument mca
> [mpiexec@apollo-2.private] HYDU_parse_array (./utils/args/args.c:140): 
> argument matching returned error
> [mpiexec@apollo-2.private] parse_args (./ui/mpich/utils.c:1387): error 
> parsing input array
> [mpiexec@apollo-2.private] HYD_uii_mpx_get_parameters 
> (./ui/mpich/utils.c:1438): unable to parse user arguments
>  
> Usage: ./mpiexec [global opts] [exec1 local opts] : [exec2 local opts] : ...
>  
> Global options (passed to all executables):
>  
>  
> Additional Information:
>  
> ll /sys/class/infiniband: mlx5_0
>  
> /sys/class/net/
> total 0
> lrwxrwxrwx 1 root root 0 Jan 14 12:03 eno49 -> 
> ../../devices/pci:00/:00:02.0/:07:00.0/net/eno49
> lrwxrwxrwx 1 root root 0 Jan 14 17:17 eno50 -> 
> ../../devices/pci:00/:00:02.0/:07:00.1/net/eno50
> lrwxrwxrwx 1 root root 0 Jan 14 17:17 eno51 -> 
> ../../devices/pci:00/:00:02.0/:07:00.2/net/eno51
> lrwxrwxrwx 1 root root 0 Jan 14 17:17 eno52 -> 
> ../../devices/pci:00/:00:02.0/:07:00.3/net/eno52
> lrwxrwxrwx 1 root root 0 Jan 14 17:17 ib0 -> 
> ../../devices/pci:00/:00:01.0/:06:00.0/net/ib0
> lrwxrwxrwx 1 root root 0 Jan 14 17:17 ib1 -> 
> ../../devices/pci:00/:00:01.0/:06:00.0/net/ib1
> lrwxrwxrwx 1 root root 0 Jan 14 17:17 lo -> ../../devices/virtual/net/lo
>  
>  
> The operating system is Linux Centos 7.
>  
>  
> Thank you in advance for your help.
>  
> Kind regards.
>  
> Soporte.modemat


Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-23 Thread Matt Thompson
_MAC, et al,

Things are looking up. By specifying, --with-verbs=no, things are looking
up. I can run helloworld. But in a new-for-me wrinkle, I can only run on
*more* than one node. Not sure I've ever seen that. Using 40 core nodes,
this:

mpirun -np 41 ./helloWorld.mpi3.SLES12.OMPI400.exe

works, and -np 40 fails:

(1027)(master) $ mpirun -np 40 ./helloWorld.mpi3.SLES12.OMPI400.exe
[borga033:05598] *** An error occurred in MPI_Barrier
[borga033:05598] *** reported by process [140735567101953,140733193388034]
[borga033:05598] *** on communicator MPI_COMM_WORLD
[borga033:05598] *** MPI_ERR_OTHER: known error not in list
[borga033:05598] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[borga033:05598] ***and potentially your MPI job)
Compiler Version: Intel(R) Fortran Intel(R) 64 Compiler for applications
running on Intel(R) 64, Version 18.0.5.274 Build 20180823
MPI Version: 3.1
MPI Library Version: Open MPI v4.0.0, package: Open MPI mathomp4@discover21
Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018
forrtl: error (78): process killed (SIGTERM)
Image  PCRoutineLineSource
helloWorld.mpi3.S  0040A38E  for__signal_handl Unknown  Unknown
libpthread-2.22.s  2B9CCB20  Unknown   Unknown  Unknown
libpthread-2.22.s  2B9CC3ED  __nanosleep   Unknown  Unknown
libopen-rte.so.40  2C3C5854  orte_show_help_no Unknown  Unknown
libopen-rte.so.40  2C3C5595  orte_show_helpUnknown  Unknown
libmpi.so.40.20.0  2B3BADC5  ompi_mpi_errors_a Unknown  Unknown
libmpi.so.40.20.0  2B3B99D9  ompi_errhandler_i Unknown  Unknown
libmpi.so.40.20.0  2B3E4586  MPI_Barrier   Unknown  Unknown
libmpi_mpifh.so.4  2B15EE53  MPI_Barrier_f08   Unknown  Unknown
libmpi_usempif08.  2ACE7742  mpi_barrier_f08_  Unknown  Unknown
helloWorld.mpi3.S  0040939F  Unknown   Unknown  Unknown
helloWorld.mpi3.S  0040915E  Unknown   Unknown  Unknown
libc-2.22.so   2BBF96D5  __libc_start_main Unknown  Unknown
helloWorld.mpi3.S  00409069  Unknown   Unknown  Unknown

So, I'm getting closer but I have to admit I've never built an MPI stack
before where running on a single node was the broken bit!

On Tue, Jan 22, 2019 at 1:31 PM Cabral, Matias A 
wrote:

> Hi Matt,
>
>
>
> There seem to be two different issues here:
>
> a)  The warning message comes from the openib btl. Given that
> Omnipath has verbs API and you have the necessary libraries in your system,
> openib btl finds itself as a potential transport and prints the warning
> during its init (openib btl is its way to deprecation). You may try to
> explicitly ask for vader btl given you are running on shared mem: -mca btl
> self,vader -mca pml ob1. Or better, explicitly build without openib:
> ./configure --with-verbs=no …
>
> b)  Not my field of expertise, but you may be having some conflict
> with the external components you are using:
> --with-pmix=/usr/nlocal/pmix/2.1 --with-libevent=/usr . You may try not
> specifying these and using the ones provided by OMPI.
>
>
>
> _MAC
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Matt
> Thompson
> *Sent:* Tuesday, January 22, 2019 6:04 AM
> *To:* Open MPI Users 
> *Subject:* Re: [OMPI users] Help Getting Started with Open MPI and PMIx
> and UCX
>
>
>
> Well,
>
>
>
> By turning off UCX compilation per Howard, things get a bit better in that
> something happens! It's not a good something, as it seems to die with an
> infiniband error. As this is an Omnipath system, is OpenMPI perhaps seeing
> libverbs somewhere and compiling it in? To wit:
>
>
>
> (1006)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
>
> --
>
> By default, for Open MPI 4.0 and later, infiniband ports on a device
>
> are not used by default.  The intent is to use UCX for these devices.
>
> You can override this policy by setting the btl_openib_allow_ib MCA
> parameter
>
> to true.
>
>
>
>   Local host:  borgc129
>
>   Local adapter:   hfi1_0
>
>   Local port:  1
>
>
>
> --
>
> --
>
> WARNING: There was an error initializing an OpenFabrics device.
>
>
>
>   Local host:   borgc129
>
>   Local device: hfi1_0
>
> --
>
> Compiler Version: Intel(R) Fortran Intel(R) 64 Compiler for applications
> ru

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-22 Thread Cabral, Matias A
Hi Matt,

There seem to be two different issues here:

a)  The warning message comes from the openib btl. Given that Omnipath has 
verbs API and you have the necessary libraries in your system, openib btl finds 
itself as a potential transport and prints the warning during its init (openib 
btl is its way to deprecation). You may try to explicitly ask for vader btl 
given you are running on shared mem: -mca btl self,vader -mca pml ob1. Or 
better, explicitly build without openib: ./configure --with-verbs=no …

b)  Not my field of expertise, but you may be having some conflict with the 
external components you are using: --with-pmix=/usr/nlocal/pmix/2.1 
--with-libevent=/usr . You may try not specifying these and using the ones 
provided by OMPI.

_MAC

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Matt Thompson
Sent: Tuesday, January 22, 2019 6:04 AM
To: Open MPI Users 
Subject: Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

Well,

By turning off UCX compilation per Howard, things get a bit better in that 
something happens! It's not a good something, as it seems to die with an 
infiniband error. As this is an Omnipath system, is OpenMPI perhaps seeing 
libverbs somewhere and compiling it in? To wit:

(1006)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
--
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:  borgc129
  Local adapter:   hfi1_0
  Local port:  1

--
--
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   borgc129
  Local device: hfi1_0
--
Compiler Version: Intel(R) Fortran Intel(R) 64 Compiler for applications 
running on Intel(R) 64, Version 18.0.5.274 Build 20180823
MPI Version: 3.1
MPI Library Version: Open MPI v4.0.0, package: Open MPI mathomp4@discover23 
Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018
[borgc129:260830] *** An error occurred in MPI_Barrier
[borgc129:260830] *** reported by process [140736833716225,46909632806913]
[borgc129:260830] *** on communicator MPI_COMM_WORLD
[borgc129:260830] *** MPI_ERR_OTHER: known error not in list
[borgc129:260830] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[borgc129:260830] ***and potentially your MPI job)
forrtl: error (78): process killed (SIGTERM)
Image  PCRoutineLineSource
helloWorld.mpi3.S  0040A38E  for__signal_handl Unknown  Unknown
libpthread-2.22.s  2B9CCB20  Unknown   Unknown  Unknown
libpthread-2.22.s  2B9C90CD  pthread_cond_wait Unknown  Unknown
libpmix.so.2.1.11  2AAAB1D780A1  PMIx_AbortUnknown  Unknown
mca_pmix_ext2x.so  2AAAB1B3AA75  ext2x_abort   Unknown  Unknown
mca_ess_pmi.so 2AAAB1724BC0  Unknown   Unknown  Unknown
libopen-rte.so.40  2C3E941C  orte_errmgr_base_ Unknown  Unknown
mca_errmgr_defaul  2AAABC401668  Unknown   Unknown  Unknown
libmpi.so.40.20.0  2B3CDBC4  ompi_mpi_abortUnknown  Unknown
libmpi.so.40.20.0  2B3BB1EF  ompi_mpi_errors_a Unknown  Unknown
libmpi.so.40.20.0  2B3B99C9  ompi_errhandler_i Unknown  Unknown
libmpi.so.40.20.0  2B3E4576  MPI_Barrier   Unknown  Unknown
libmpi_mpifh.so.4  2B15EE53  MPI_Barrier_f08   Unknown  Unknown
libmpi_usempif08.  2ACE7732  mpi_barrier_f08_  Unknown  Unknown
helloWorld.mpi3.S  0040939F  Unknown   Unknown  Unknown
helloWorld.mpi3.S  0040915E  Unknown   Unknown  Unknown
libc-2.22.so<http://libc-2.22.so>   2BBF96D5  __libc_start_main 
Unknown  Unknown
helloWorld.mpi3.S  00409069  Unknown   Unknown  Unknown

On Sun, Jan 20, 2019 at 4:19 PM Howard Pritchard 
mailto:hpprit...@gmail.com>> wrote:
Hi Matt

Definitely do not include the ucx option for an omnipath cluster.  Actually if 
you accidentally installed ucx in it’s default location use on the system 
Switch to this config option

—with-ucx=no

Otherwise you will hit

https://github.com/openucx/ucx/issues/750

Howard


Gilles Gouaillardet 
mailto:gilles.gouaillar...@gmail.com>> schrieb 
am Sa. 19. Jan. 2019 um 18:41:
Matt,

There are two ways of using PMIx

- if you use mpirun, then the MPI app (e.g. the PMIx client) will talk
to mpirun and orted daemons (e.g. the PMIx server)
- if you use SLURM srun, then the MPI app will directly talk to the
PMIx server

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-22 Thread Matt Thompson
Well,

By turning off UCX compilation per Howard, things get a bit better in that
something happens! It's not a good something, as it seems to die with an
infiniband error. As this is an Omnipath system, is OpenMPI perhaps seeing
libverbs somewhere and compiling it in? To wit:

(1006)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
--
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA
parameter
to true.

  Local host:  borgc129
  Local adapter:   hfi1_0
  Local port:  1

--
--
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   borgc129
  Local device: hfi1_0
--
Compiler Version: Intel(R) Fortran Intel(R) 64 Compiler for applications
running on Intel(R) 64, Version 18.0.5.274 Build 20180823
MPI Version: 3.1
MPI Library Version: Open MPI v4.0.0, package: Open MPI mathomp4@discover23
Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018
[borgc129:260830] *** An error occurred in MPI_Barrier
[borgc129:260830] *** reported by process [140736833716225,46909632806913]
[borgc129:260830] *** on communicator MPI_COMM_WORLD
[borgc129:260830] *** MPI_ERR_OTHER: known error not in list
[borgc129:260830] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[borgc129:260830] ***and potentially your MPI job)
forrtl: error (78): process killed (SIGTERM)
Image  PCRoutineLineSource
helloWorld.mpi3.S  0040A38E  for__signal_handl Unknown  Unknown
libpthread-2.22.s  2B9CCB20  Unknown   Unknown  Unknown
libpthread-2.22.s  2B9C90CD  pthread_cond_wait Unknown  Unknown
libpmix.so.2.1.11  2AAAB1D780A1  PMIx_AbortUnknown  Unknown
mca_pmix_ext2x.so  2AAAB1B3AA75  ext2x_abort   Unknown  Unknown
mca_ess_pmi.so 2AAAB1724BC0  Unknown   Unknown  Unknown
libopen-rte.so.40  2C3E941C  orte_errmgr_base_ Unknown  Unknown
mca_errmgr_defaul  2AAABC401668  Unknown   Unknown  Unknown
libmpi.so.40.20.0  2B3CDBC4  ompi_mpi_abortUnknown  Unknown
libmpi.so.40.20.0  2B3BB1EF  ompi_mpi_errors_a Unknown  Unknown
libmpi.so.40.20.0  2B3B99C9  ompi_errhandler_i Unknown  Unknown
libmpi.so.40.20.0  2B3E4576  MPI_Barrier   Unknown  Unknown
libmpi_mpifh.so.4  2B15EE53  MPI_Barrier_f08   Unknown  Unknown
libmpi_usempif08.  2ACE7732  mpi_barrier_f08_  Unknown  Unknown
helloWorld.mpi3.S  0040939F  Unknown   Unknown  Unknown
helloWorld.mpi3.S  0040915E  Unknown   Unknown  Unknown
libc-2.22.so   2BBF96D5  __libc_start_main Unknown  Unknown
helloWorld.mpi3.S  00409069  Unknown   Unknown  Unknown

On Sun, Jan 20, 2019 at 4:19 PM Howard Pritchard 
wrote:

> Hi Matt
>
> Definitely do not include the ucx option for an omnipath cluster.
> Actually if you accidentally installed ucx in it’s default location use on
> the system Switch to this config option
>
> —with-ucx=no
>
> Otherwise you will hit
>
> https://github.com/openucx/ucx/issues/750
>
> Howard
>
>
> Gilles Gouaillardet  schrieb am Sa. 19.
> Jan. 2019 um 18:41:
>
>> Matt,
>>
>> There are two ways of using PMIx
>>
>> - if you use mpirun, then the MPI app (e.g. the PMIx client) will talk
>> to mpirun and orted daemons (e.g. the PMIx server)
>> - if you use SLURM srun, then the MPI app will directly talk to the
>> PMIx server provided by SLURM. (note you might have to srun
>> --mpi=pmix_v2 or something)
>>
>> In the former case, it does not matter whether you use the embedded or
>> external PMIx.
>> In the latter case, Open MPI and SLURM have to use compatible PMIx
>> libraries, and you can either check the cross-version compatibility
>> matrix,
>> or build Open MPI with the same PMIx used by SLURM to be on the safe
>> side (not a bad idea IMHO).
>>
>>
>> Regarding the hang, I suggest you try different things
>> - use mpirun in a SLURM job (e.g. sbatch instead of salloc so mpirun
>> runs on a compute node rather than on a frontend node)
>> - try something even simpler such as mpirun hostname (both with sbatch
>> and salloc)
>> - explicitly specify the network to be used for the wire-up. you can
>> for example mpirun --mca oob_tcp_if_include 192.168.0.0/24 if this is
>> the network subnet by which all the nodes (e.g. compute nodes and
>> frontend node if you use salloc) communicate.
>>
>>
>> Cheers,
>>
>> Gilles
>>
>> On Sat, Jan 19, 2019 at 3:31 AM Matt Thompson  

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-20 Thread Howard Pritchard
Hi Matt

Definitely do not include the ucx option for an omnipath cluster.  Actually
if you accidentally installed ucx in it’s default location use on the
system Switch to this config option

—with-ucx=no

Otherwise you will hit

https://github.com/openucx/ucx/issues/750

Howard


Gilles Gouaillardet  schrieb am Sa. 19. Jan.
2019 um 18:41:

> Matt,
>
> There are two ways of using PMIx
>
> - if you use mpirun, then the MPI app (e.g. the PMIx client) will talk
> to mpirun and orted daemons (e.g. the PMIx server)
> - if you use SLURM srun, then the MPI app will directly talk to the
> PMIx server provided by SLURM. (note you might have to srun
> --mpi=pmix_v2 or something)
>
> In the former case, it does not matter whether you use the embedded or
> external PMIx.
> In the latter case, Open MPI and SLURM have to use compatible PMIx
> libraries, and you can either check the cross-version compatibility
> matrix,
> or build Open MPI with the same PMIx used by SLURM to be on the safe
> side (not a bad idea IMHO).
>
>
> Regarding the hang, I suggest you try different things
> - use mpirun in a SLURM job (e.g. sbatch instead of salloc so mpirun
> runs on a compute node rather than on a frontend node)
> - try something even simpler such as mpirun hostname (both with sbatch
> and salloc)
> - explicitly specify the network to be used for the wire-up. you can
> for example mpirun --mca oob_tcp_if_include 192.168.0.0/24 if this is
> the network subnet by which all the nodes (e.g. compute nodes and
> frontend node if you use salloc) communicate.
>
>
> Cheers,
>
> Gilles
>
> On Sat, Jan 19, 2019 at 3:31 AM Matt Thompson  wrote:
> >
> > On Fri, Jan 18, 2019 at 1:13 PM Jeff Squyres (jsquyres) via users <
> users@lists.open-mpi.org> wrote:
> >>
> >> On Jan 18, 2019, at 12:43 PM, Matt Thompson  wrote:
> >> >
> >> > With some help, I managed to build an Open MPI 4.0.0 with:
> >>
> >> We can discuss each of these params to let you know what they are.
> >>
> >> > ./configure --disable-wrapper-rpath --disable-wrapper-runpath
> >>
> >> Did you have a reason for disabling these?  They're generally good
> things.  What they do is add linker flags to the wrapper compilers (i.e.,
> mpicc and friends) that basically put a default path to find libraries at
> run time (that can/will in most cases override LD_LIBRARY_PATH -- but you
> can override these linked-in-default-paths if you want/need to).
> >
> >
> > I've had these in my Open MPI builds for a while now. The reason was one
> of the libraries I need for the climate model I work on went nuts if both
> of them weren't there. It was originally the rpath one but then eventually
> (Open MPI 3?) I had to add the runpath one. But I have been updating the
> libraries more aggressively recently (due to OS upgrades) so it's possible
> this is no longer needed.
> >
> >>
> >>
> >> > --with-psm2
> >>
> >> Ensure that Open MPI can include support for the PSM2 library, and
> abort configure if it cannot.
> >>
> >> > --with-slurm
> >>
> >> Ensure that Open MPI can include support for SLURM, and abort configure
> if it cannot.
> >>
> >> > --enable-mpi1-compatibility
> >>
> >> Add support for MPI_Address and other MPI-1 functions that have since
> been deleted from the MPI 3.x specification.
> >>
> >> > --with-ucx
> >>
> >> Ensure that Open MPI can include support for UCX, and abort configure
> if it cannot.
> >>
> >> > --with-pmix=/usr/nlocal/pmix/2.1
> >>
> >> Tells Open MPI to use the PMIx that is installed at
> /usr/nlocal/pmix/2.1 (instead of using the PMIx that is bundled internally
> to Open MPI's source code tree/expanded tarball).
> >>
> >> Unless you have a reason to use the external PMIx, the internal/bundled
> PMIx is usually sufficient.
> >
> >
> > Ah. I did not know that. I figured if our SLURM was built linked to a
> specific PMIx v2 that I should build Open MPI with the same PMIx. I'll
> build an Open MPI 4 without specifying this.
> >
> >>
> >>
> >> > --with-libevent=/usr
> >>
> >> Same as previous; change "pmix" to "libevent" (i.e., use the external
> libevent instead of the bundled libevent).
> >>
> >> > CC=icc CXX=icpc FC=ifort
> >>
> >> Specify the exact compilers to use.
> >>
> >> > The MPI 1 is because I need to build HDF5 eventually and I added psm2
> because it's an Omnipath cluster. The libevent was probably a red herring
> as libevent-devel wasn't installed on the system. It was eventually, and I
> just didn't remove the flag. And I saw no errors in the build!
> >>
> >> Might as well remove the --with-libevent if you don't need it.
> >>
> >> > However, I seem to have built an Open MPI that doesn't work:
> >> >
> >> > (1099)(master) $ mpirun --version
> >> > mpirun (Open MPI) 4.0.0
> >> >
> >> > Report bugs to http://www.open-mpi.org/community/help/
> >> > (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
> >> >
> >> > It just sits there...forever. Can the gurus here help me figure out
> what I managed to break? Perhaps I added too much to my configure 

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-19 Thread Gilles Gouaillardet
Matt,

There are two ways of using PMIx

- if you use mpirun, then the MPI app (e.g. the PMIx client) will talk
to mpirun and orted daemons (e.g. the PMIx server)
- if you use SLURM srun, then the MPI app will directly talk to the
PMIx server provided by SLURM. (note you might have to srun
--mpi=pmix_v2 or something)

In the former case, it does not matter whether you use the embedded or
external PMIx.
In the latter case, Open MPI and SLURM have to use compatible PMIx
libraries, and you can either check the cross-version compatibility
matrix,
or build Open MPI with the same PMIx used by SLURM to be on the safe
side (not a bad idea IMHO).


Regarding the hang, I suggest you try different things
- use mpirun in a SLURM job (e.g. sbatch instead of salloc so mpirun
runs on a compute node rather than on a frontend node)
- try something even simpler such as mpirun hostname (both with sbatch
and salloc)
- explicitly specify the network to be used for the wire-up. you can
for example mpirun --mca oob_tcp_if_include 192.168.0.0/24 if this is
the network subnet by which all the nodes (e.g. compute nodes and
frontend node if you use salloc) communicate.


Cheers,

Gilles

On Sat, Jan 19, 2019 at 3:31 AM Matt Thompson  wrote:
>
> On Fri, Jan 18, 2019 at 1:13 PM Jeff Squyres (jsquyres) via users 
>  wrote:
>>
>> On Jan 18, 2019, at 12:43 PM, Matt Thompson  wrote:
>> >
>> > With some help, I managed to build an Open MPI 4.0.0 with:
>>
>> We can discuss each of these params to let you know what they are.
>>
>> > ./configure --disable-wrapper-rpath --disable-wrapper-runpath
>>
>> Did you have a reason for disabling these?  They're generally good things.  
>> What they do is add linker flags to the wrapper compilers (i.e., mpicc and 
>> friends) that basically put a default path to find libraries at run time 
>> (that can/will in most cases override LD_LIBRARY_PATH -- but you can 
>> override these linked-in-default-paths if you want/need to).
>
>
> I've had these in my Open MPI builds for a while now. The reason was one of 
> the libraries I need for the climate model I work on went nuts if both of 
> them weren't there. It was originally the rpath one but then eventually (Open 
> MPI 3?) I had to add the runpath one. But I have been updating the libraries 
> more aggressively recently (due to OS upgrades) so it's possible this is no 
> longer needed.
>
>>
>>
>> > --with-psm2
>>
>> Ensure that Open MPI can include support for the PSM2 library, and abort 
>> configure if it cannot.
>>
>> > --with-slurm
>>
>> Ensure that Open MPI can include support for SLURM, and abort configure if 
>> it cannot.
>>
>> > --enable-mpi1-compatibility
>>
>> Add support for MPI_Address and other MPI-1 functions that have since been 
>> deleted from the MPI 3.x specification.
>>
>> > --with-ucx
>>
>> Ensure that Open MPI can include support for UCX, and abort configure if it 
>> cannot.
>>
>> > --with-pmix=/usr/nlocal/pmix/2.1
>>
>> Tells Open MPI to use the PMIx that is installed at /usr/nlocal/pmix/2.1 
>> (instead of using the PMIx that is bundled internally to Open MPI's source 
>> code tree/expanded tarball).
>>
>> Unless you have a reason to use the external PMIx, the internal/bundled PMIx 
>> is usually sufficient.
>
>
> Ah. I did not know that. I figured if our SLURM was built linked to a 
> specific PMIx v2 that I should build Open MPI with the same PMIx. I'll build 
> an Open MPI 4 without specifying this.
>
>>
>>
>> > --with-libevent=/usr
>>
>> Same as previous; change "pmix" to "libevent" (i.e., use the external 
>> libevent instead of the bundled libevent).
>>
>> > CC=icc CXX=icpc FC=ifort
>>
>> Specify the exact compilers to use.
>>
>> > The MPI 1 is because I need to build HDF5 eventually and I added psm2 
>> > because it's an Omnipath cluster. The libevent was probably a red herring 
>> > as libevent-devel wasn't installed on the system. It was eventually, and I 
>> > just didn't remove the flag. And I saw no errors in the build!
>>
>> Might as well remove the --with-libevent if you don't need it.
>>
>> > However, I seem to have built an Open MPI that doesn't work:
>> >
>> > (1099)(master) $ mpirun --version
>> > mpirun (Open MPI) 4.0.0
>> >
>> > Report bugs to http://www.open-mpi.org/community/help/
>> > (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
>> >
>> > It just sits there...forever. Can the gurus here help me figure out what I 
>> > managed to break? Perhaps I added too much to my configure line? Not 
>> > enough?
>>
>> There could be a few things going on here.
>>
>> Are you running inside a SLURM job?  E.g., in a "salloc" job, or in an 
>> "sbatch" script?
>
>
> I have salloc'd 8 nodes of 40 cores each. Intel MPI 18 and 19 work just fine 
> (as you'd hope on an Omnipath cluster), but for some reason Open MPI is 
> twitchy on this cluster. I once managed to get Open MPI 3.0.1 working (a few 
> months ago), and it had some interesting startup scaling I liked (slow at low 
> core count, but getting 

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-18 Thread Cabral, Matias A
Hi Matt,

Few comments/questions:

-  If your cluster has Omni-Path, you won’t need UCX. Instead you can 
run using PSM2, or alternatively OFI (a.k.a. Libfabric)

-  With the command you shared below (4 ranks on the local node) (I 
think) a shared mem transport is being selected (vader?). So, if the job is not 
starting this seems to be a runtime issue rather than transport…. Pmix? slurm?
Thanks
_MAC


From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Matt Thompson
Sent: Friday, January 18, 2019 10:27 AM
To: Open MPI Users 
Subject: Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

On Fri, Jan 18, 2019 at 1:13 PM Jeff Squyres (jsquyres) via users 
mailto:users@lists.open-mpi.org>> wrote:
On Jan 18, 2019, at 12:43 PM, Matt Thompson 
mailto:fort...@gmail.com>> wrote:
>
> With some help, I managed to build an Open MPI 4.0.0 with:

We can discuss each of these params to let you know what they are.

> ./configure --disable-wrapper-rpath --disable-wrapper-runpath

Did you have a reason for disabling these?  They're generally good things.  
What they do is add linker flags to the wrapper compilers (i.e., mpicc and 
friends) that basically put a default path to find libraries at run time (that 
can/will in most cases override LD_LIBRARY_PATH -- but you can override these 
linked-in-default-paths if you want/need to).

I've had these in my Open MPI builds for a while now. The reason was one of the 
libraries I need for the climate model I work on went nuts if both of them 
weren't there. It was originally the rpath one but then eventually (Open MPI 
3?) I had to add the runpath one. But I have been updating the libraries more 
aggressively recently (due to OS upgrades) so it's possible this is no longer 
needed.


> --with-psm2

Ensure that Open MPI can include support for the PSM2 library, and abort 
configure if it cannot.

> --with-slurm

Ensure that Open MPI can include support for SLURM, and abort configure if it 
cannot.

> --enable-mpi1-compatibility

Add support for MPI_Address and other MPI-1 functions that have since been 
deleted from the MPI 3.x specification.

> --with-ucx

Ensure that Open MPI can include support for UCX, and abort configure if it 
cannot.

> --with-pmix=/usr/nlocal/pmix/2.1

Tells Open MPI to use the PMIx that is installed at /usr/nlocal/pmix/2.1 
(instead of using the PMIx that is bundled internally to Open MPI's source code 
tree/expanded tarball).

Unless you have a reason to use the external PMIx, the internal/bundled PMIx is 
usually sufficient.

Ah. I did not know that. I figured if our SLURM was built linked to a specific 
PMIx v2 that I should build Open MPI with the same PMIx. I'll build an Open MPI 
4 without specifying this.


> --with-libevent=/usr

Same as previous; change "pmix" to "libevent" (i.e., use the external libevent 
instead of the bundled libevent).

> CC=icc CXX=icpc FC=ifort

Specify the exact compilers to use.

> The MPI 1 is because I need to build HDF5 eventually and I added psm2 because 
> it's an Omnipath cluster. The libevent was probably a red herring as 
> libevent-devel wasn't installed on the system. It was eventually, and I just 
> didn't remove the flag. And I saw no errors in the build!

Might as well remove the --with-libevent if you don't need it.

> However, I seem to have built an Open MPI that doesn't work:
>
> (1099)(master) $ mpirun --version
> mpirun (Open MPI) 4.0.0
>
> Report bugs to http://www.open-mpi.org/community/help/
> (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
>
> It just sits there...forever. Can the gurus here help me figure out what I 
> managed to break? Perhaps I added too much to my configure line? Not enough?

There could be a few things going on here.

Are you running inside a SLURM job?  E.g., in a "salloc" job, or in an "sbatch" 
script?

I have salloc'd 8 nodes of 40 cores each. Intel MPI 18 and 19 work just fine 
(as you'd hope on an Omnipath cluster), but for some reason Open MPI is twitchy 
on this cluster. I once managed to get Open MPI 3.0.1 working (a few months 
ago), and it had some interesting startup scaling I liked (slow at low core 
count, but getting close to Intel MPI at high core count), though it seemed to 
not work after about 100 nodes (4000 processes) or so.

--
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna 
Rampton
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-18 Thread Matt Thompson
On Fri, Jan 18, 2019 at 1:13 PM Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> On Jan 18, 2019, at 12:43 PM, Matt Thompson  wrote:
> >
> > With some help, I managed to build an Open MPI 4.0.0 with:
>
> We can discuss each of these params to let you know what they are.
>
> > ./configure --disable-wrapper-rpath --disable-wrapper-runpath
>
> Did you have a reason for disabling these?  They're generally good
> things.  What they do is add linker flags to the wrapper compilers (i.e.,
> mpicc and friends) that basically put a default path to find libraries at
> run time (that can/will in most cases override LD_LIBRARY_PATH -- but you
> can override these linked-in-default-paths if you want/need to).
>

I've had these in my Open MPI builds for a while now. The reason was one of
the libraries I need for the climate model I work on went nuts if both of
them weren't there. It was originally the rpath one but then eventually
(Open MPI 3?) I had to add the runpath one. But I have been updating the
libraries more aggressively recently (due to OS upgrades) so it's possible
this is no longer needed.


>
> > --with-psm2
>
> Ensure that Open MPI can include support for the PSM2 library, and abort
> configure if it cannot.
>
> > --with-slurm
>
> Ensure that Open MPI can include support for SLURM, and abort configure if
> it cannot.
>
> > --enable-mpi1-compatibility
>
> Add support for MPI_Address and other MPI-1 functions that have since been
> deleted from the MPI 3.x specification.
>
> > --with-ucx
>
> Ensure that Open MPI can include support for UCX, and abort configure if
> it cannot.
>
> > --with-pmix=/usr/nlocal/pmix/2.1
>
> Tells Open MPI to use the PMIx that is installed at /usr/nlocal/pmix/2.1
> (instead of using the PMIx that is bundled internally to Open MPI's source
> code tree/expanded tarball).
>
> Unless you have a reason to use the external PMIx, the internal/bundled
> PMIx is usually sufficient.
>

Ah. I did not know that. I figured if our SLURM was built linked to a
specific PMIx v2 that I should build Open MPI with the same PMIx. I'll
build an Open MPI 4 without specifying this.


>
> > --with-libevent=/usr
>
> Same as previous; change "pmix" to "libevent" (i.e., use the external
> libevent instead of the bundled libevent).
>
> > CC=icc CXX=icpc FC=ifort
>
> Specify the exact compilers to use.
>
> > The MPI 1 is because I need to build HDF5 eventually and I added psm2
> because it's an Omnipath cluster. The libevent was probably a red herring
> as libevent-devel wasn't installed on the system. It was eventually, and I
> just didn't remove the flag. And I saw no errors in the build!
>
> Might as well remove the --with-libevent if you don't need it.
>
> > However, I seem to have built an Open MPI that doesn't work:
> >
> > (1099)(master) $ mpirun --version
> > mpirun (Open MPI) 4.0.0
> >
> > Report bugs to http://www.open-mpi.org/community/help/
> > (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
> >
> > It just sits there...forever. Can the gurus here help me figure out what
> I managed to break? Perhaps I added too much to my configure line? Not
> enough?
>
> There could be a few things going on here.
>
> Are you running inside a SLURM job?  E.g., in a "salloc" job, or in an
> "sbatch" script?
>

I have salloc'd 8 nodes of 40 cores each. Intel MPI 18 and 19 work just
fine (as you'd hope on an Omnipath cluster), but for some reason Open MPI
is twitchy on this cluster. I once managed to get Open MPI 3.0.1 working (a
few months ago), and it had some interesting startup scaling I liked (slow
at low core count, but getting close to Intel MPI at high core count),
though it seemed to not work after about 100 nodes (4000 processes) or so.

-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-18 Thread Jeff Squyres (jsquyres) via users
On Jan 18, 2019, at 12:43 PM, Matt Thompson  wrote:
> 
> With some help, I managed to build an Open MPI 4.0.0 with:

We can discuss each of these params to let you know what they are.

> ./configure --disable-wrapper-rpath --disable-wrapper-runpath

Did you have a reason for disabling these?  They're generally good things.  
What they do is add linker flags to the wrapper compilers (i.e., mpicc and 
friends) that basically put a default path to find libraries at run time (that 
can/will in most cases override LD_LIBRARY_PATH -- but you can override these 
linked-in-default-paths if you want/need to).

> --with-psm2

Ensure that Open MPI can include support for the PSM2 library, and abort 
configure if it cannot.

> --with-slurm 

Ensure that Open MPI can include support for SLURM, and abort configure if it 
cannot.

> --enable-mpi1-compatibility

Add support for MPI_Address and other MPI-1 functions that have since been 
deleted from the MPI 3.x specification.

> --with-ucx

Ensure that Open MPI can include support for UCX, and abort configure if it 
cannot.

> --with-pmix=/usr/nlocal/pmix/2.1

Tells Open MPI to use the PMIx that is installed at /usr/nlocal/pmix/2.1 
(instead of using the PMIx that is bundled internally to Open MPI's source code 
tree/expanded tarball).

Unless you have a reason to use the external PMIx, the internal/bundled PMIx is 
usually sufficient.

> --with-libevent=/usr

Same as previous; change "pmix" to "libevent" (i.e., use the external libevent 
instead of the bundled libevent).

> CC=icc CXX=icpc FC=ifort

Specify the exact compilers to use.

> The MPI 1 is because I need to build HDF5 eventually and I added psm2 because 
> it's an Omnipath cluster. The libevent was probably a red herring as 
> libevent-devel wasn't installed on the system. It was eventually, and I just 
> didn't remove the flag. And I saw no errors in the build!

Might as well remove the --with-libevent if you don't need it.

> However, I seem to have built an Open MPI that doesn't work:
> 
> (1099)(master) $ mpirun --version
> mpirun (Open MPI) 4.0.0
> 
> Report bugs to http://www.open-mpi.org/community/help/
> (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
> 
> It just sits there...forever. Can the gurus here help me figure out what I 
> managed to break? Perhaps I added too much to my configure line? Not enough?

There could be a few things going on here.

Are you running inside a SLURM job?  E.g., in a "salloc" job, or in an "sbatch" 
script?

-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-18 Thread Matt Thompson
All,

With some help, I managed to build an Open MPI 4.0.0 with:

./configure --disable-wrapper-rpath --disable-wrapper-runpath --with-psm2
--with-slurm --enable-mpi1-compatibility --with-ucx
--with-pmix=/usr/nlocal/pmix/2.1 --with-libevent=/usr CC=icc CXX=icpc
FC=ifort

The MPI 1 is because I need to build HDF5 eventually and I added psm2
because it's an Omnipath cluster. The libevent was probably a red herring
as libevent-devel wasn't installed on the system. It was eventually, and I
just didn't remove the flag. And I saw no errors in the build!

However, I seem to have built an Open MPI that doesn't work:

(1099)(master) $ mpirun --version
mpirun (Open MPI) 4.0.0

Report bugs to http://www.open-mpi.org/community/help/
(1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe

It just sits there...forever. Can the gurus here help me figure out what I
managed to break? Perhaps I added too much to my configure line? Not enough?

Thanks,
Matt

On Thu, Jan 17, 2019 at 11:10 AM Matt Thompson  wrote:

> Dear Open MPI Gurus,
>
> A cluster I use recently updated their SLURM to have support for UCX and
> PMIx. These are names I've seen and heard often at SC BoFs and posters, but
> now is my first time to play with them.
>
> So, my first question is how exactly should I build Open MPI to try these
> features out. I'm guessing I'll need things like "--with-ucx" to test UCX,
> but is anything needed for PMIx?
>
> Second, when it comes to running Open MPI, are there new MCA parameters I
> need to look out for when testing?
>
> Sorry for the generic questions, but I'm more on the user end of the
> cluster than the administrator end, so I tend to get lost in the detailed
> presentations, etc. I see online.
>
> Thanks,
> Matt
> --
> Matt Thompson
>“The fact is, this is about us identifying what we do best and
>finding more ways of doing less of it better” -- Director of Better
> Anna Rampton
>


-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] help installing openmpi 3.0 in ubuntu 16.04

2018-03-16 Thread Jeff Squyres (jsquyres)
(Sending this to the users list, not to just the owner of the users list)

It looks like you might have installed Open MPI correctly.

But you have to give some command line options to mpirun to tell it what to do 
-- you're basically getting an error saying "you didn't tell me what to do, so 
I didn't do anything."

You can do "mpirun --help" to see common options, or "mpirun --help --all" to 
see *all* of mpirun's options (there are many).  You can also see them man page 
for mpirun(1).

Generally, if you're not running in a scheduled environment (e.g., you're just 
running on your laptop, or a handful of nodes that are not using SLURM, Torque, 
or some other scheduler), you tell mpirun what nodes to use, what MPI 
application executable to launch, and how many copies to launch (i.e., the size 
of MPI_COMM_WORLD).  For example:

$ mpirun --host node1,node2,node3,node4 -np 96 ./my_mpi_program

Where:

- node1-node4 is the IP-resolvable name of your 4 nodes
- "-np 96" means to launch 96 copies (e.g., 24 per node -- for this example, 
I'm assuming you have 24 cores per node)
- ./my_mpi_program: a compiled MPI application that you build with mpicc or 
mpifort (e.g., a C or Fortran program that calls MPI_Init, MPI_Finalize, 
...etc.).

You might also want to see: https://www.open-mpi.org/faq/?category=running



> On Mar 15, 2018, at 8:23 PM, Keshab Bashyal  wrote:
> 
> Dear Sir, 
> I installed openmpi version3.0 in ubuntu 16.04.
> I followed the exact instruction in the pdf file attached here.
> I set up the path as in the pdf.
> After installing I tried to type "mpirun" in the terminal I get the following 
> message:
> 
> --
> mpirun could not find anything to do.
> 
> It is possible that you forgot to specify how many processes to run via the 
> "-np" argument
> 
> 
> Previously, I had installed openmpi (1.6.5) , and when I used to type 
> "mpirun",
> it used to give me bunch of options showing how to use it. 
> Could you help me installing openmpi in a correct way. 
> 
> Thank you.
> 
> 
> 
> 


-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Help debugging invalid read

2018-02-19 Thread Florian Lindner
Ok, I think I have found the problem

During std::vector::push_back or emplace_back a realloc happens and thus memory 
locations that I gave to MPI_Isend
become invalid.

My loop now reads:

  std::vector eventSendBuf(eventsSize); // Buffer to hold the 
MPI_EventData object

  for (int i = 0; i < eventsSize; ++i) {
MPI_Request req;

eventSendBuf.at(i).size = 5;

cout << "Isending event " << i << endl;
MPI_Isend([i], 1, MPI_EVENTDATA, 0, 0, MPI_COMM_WORLD, );
requests.push_back(req);
  }

Best,
Florian


Am 19.02.2018 um 10:14 schrieb Florian Lindner:
> Hello,
> 
> I am having problems understanding an error valgrind gives me. I tried to bog 
> down the program as much as possible. The
> original program as well as the test example both work fine, but when I link 
> the created library to another application
> I get segfaults. I think that this piece of code is to blame. I run valgrind 
> on it and get an invalid read.
> 
> The code can be seen at 
> https://gist.github.com/floli/d62d16ce7cabb4522e2ae7e6b3cfda43 or below.
> 
> It's about 60 lines of C/C++ code.
> 
> I have also attached the valgrind report below the code.
> 
> The code registers a custom MPI datatype and sends that using an isend. It 
> does not crash or produces invalid data, but
> I fear that the invalid read message from valgrind is a hint of an existing 
> memory corruption.
> 
> But I got no idea where that could happen.
> 
> OpenMPI 3.0.0 @ Arch
> 
> I am very thankful of any hints whatsover!
> 
> Florian
> 
> 
> 
> 
> 
> // Compile and test with: mpicxx -std=c++11 -g -O0 mpitest.cpp  &&
> LD_PRELOAD=/usr/lib/valgrind/libmpiwrap-amd64-linux.so mpirun -n 1 valgrind 
> --read-var-info=yes --leak-check=full ./a.out
> 
> #include 
> #include 
> 
> #include 
> 
> using namespace std;
> 
> struct MPI_EventData
> {
>   int size;
> };
> 
> 
> void collect()
> {
>   // Register MPI datatype
>   MPI_Datatype MPI_EVENTDATA;
>   int blocklengths[] = {1};
>   MPI_Aint displacements[] = {offsetof(MPI_EventData, size) };
>   MPI_Datatype types[] = {MPI_INT};
>   MPI_Type_create_struct(1, blocklengths, displacements, types, 
> _EVENTDATA);
>   MPI_Type_commit(_EVENTDATA);
> 
>   int rank, MPIsize;
>   MPI_Comm_rank(MPI_COMM_WORLD, );
>   MPI_Comm_size(MPI_COMM_WORLD, );
> 
>   std::vector requests;
>   std::vector eventsPerRank(MPIsize);
>   size_t eventsSize = 3; // each rank sends three events, invalid read 
> happens only if eventsSize > 1
>   MPI_Gather(, 1, MPI_INT, eventsPerRank.data(), 1, MPI_INT, 0, 
> MPI_COMM_WORLD);
> 
>   std::vector eventSendBuf; // Buffer to hold the 
> MPI_EventData object
> 
>   for (int i = 0; i < eventsSize; ++i) {
> MPI_EventData eventdata;
> MPI_Request req;
> 
> eventdata.size = 5;
> eventSendBuf.push_back(eventdata);
> 
> cout << "Isending event " << i << endl;
> MPI_Isend((), 1, MPI_EVENTDATA, 0, 0, MPI_COMM_WORLD, 
> );
> requests.push_back(req);
>   }
> 
>   if (rank == 0) {
> for (int i = 0; i < MPIsize; ++i) {
>   for (int j = 0; j < eventsPerRank[i]; ++j) {
> MPI_EventData ev;
> MPI_Recv(, 1, MPI_EVENTDATA, i, MPI_ANY_TAG, MPI_COMM_WORLD, 
> MPI_STATUS_IGNORE);
> 
> cout << "Received Size = " << ev.size << endl;
>   }
> }
>   }
>   MPI_Waitall(requests.size(), requests.data(), MPI_STATUSES_IGNORE);
>   MPI_Type_free(_EVENTDATA);
> }
> 
> 
> int main(int argc, char *argv[])
> {
>   MPI_Init(, );
> 
>   collect();
> 
>   MPI_Finalize();
> }
> 
> 
> /*
> 
>  % mpicxx -std=c++11 -g -O0 mpitest.cpp  && 
> LD_PRELOAD=/usr/lib/valgrind/libmpiwrap-amd64-linux.so mpirun -n 1 valgrind
> --read-var-info=yes --leak-check=full ./a.out
> ==13584== Memcheck, a memory error detector
> ==13584== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
> ==13584== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
> ==13584== Command: ./a.out
> ==13584==
> valgrind MPI wrappers 13584: Active for pid 13584
> valgrind MPI wrappers 13584: Try MPIWRAP_DEBUG=help for possible options
> ==13584== Thread 3:
> ==13584== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s)
> ==13584==at 0x61A0FE6: epoll_pwait (in /usr/lib/libc-2.26.so)
> ==13584==by 0x677CDDC: ??? (in /usr/lib/openmpi/libopen-pal.so.40.0.0)
> ==13584==by 0x6780EDA: opal_libevent2022_event_base_loop (in 
> /usr/lib/openmpi/libopen-pal.so.40.0.0)
> ==13584==by 0x93100CE: ??? (in 
> /usr/lib/openmpi/openmpi/mca_pmix_pmix2x.so)
> ==13584==by 0x5E9408B: start_thread (in /usr/lib/libpthread-2.26.so)
> ==13584==by 0x61A0E7E: clone (in /usr/lib/libc-2.26.so)
> ==13584==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
> ==13584==
> Isending event 0
> ==13584== Thread 1:
> ==13584== Invalid read of size 2
> ==13584==at 0x4C33B20: memmove (vg_replace_strmem.c:1258)
> ==13584==by 0x11A7BB: MPI_EventData* std::__copy_move std::random_access_iterator_tag>::__copy_m(MPI_EventData 
> const*, 

Re: [OMPI users] Help with binding processes correctly in Hybrid code (Openmpi +openmp)

2017-11-14 Thread Gilles Gouaillardet
Hi,

per 
https://www2.cisl.ucar.edu/resources/computational-systems/cheyenne/running-jobs/pbs-pro-job-script-examples,
you can try

#PBS -l select=2:ncpus=16:mpiprocs=2:ompthreads=8

Cheers,

Gilles


On Tue, Nov 14, 2017 at 4:32 PM, Anil K. Dasanna
 wrote:
> Hello all,
>
> I am relatively new to mpi computing. I am doing particle simulations.
> So far, I only used pure mpi and I never had a problem. But for my system,
> its the best if one uses hybrid programming.
> But I always fail to correctly bind all processes and receive binding errors
> from cluster.
> Could one of you please clarify correct parameters for mpirun for below two
> cases:
>
> 1) I would like to use lets say two mpi tasks and 16 processors as openmp
> threads.
> I request nodes:2:ppn=16.
> 2) For the same case, how should i give parameters such that i have 4 mpi
> tasks and
> 8 openmp threads.
>
> I also tried options with --map-by and it happened that sometimes both mpi
> tasks are being selected
> from same node and other node is idle. I really appreciate your help.
> And my openmp version is 1.8.
>
>
> --
> Kind Regards,
> Anil.
> *
> "Its impossible" - said Pride
> "Its risky" - said Experience
> "Its pointless" - said Reason
> "Give it a try" - whispered the heart
> *
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Help

2017-04-27 Thread Gus Correa

On 04/27/2017 06:21 AM, Corina Jeni Tudorache wrote:

Hello,



I am trying to install Open MPI on Centos and I got stuck. I have
installed an GNU compiler and after that I run the command: _yum install
openmpi-devel.x86_64. _But when I run command mpi selector –- list I
receive this error “mpi: command not found”

I am following the instruction from here:
https://na-inet.jp/na/pccluster/centos_x86_64-en.html

Any help is much appreciated. J

Corina


You need to install openmpi.x86_64 also, not only openmpi-devel.x86_64.
That is the minimum.

I hope this helps,
Gus Correa





___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Help

2017-04-27 Thread gilles
 Or you can replace the mpi-selector thing with

module load mpi/openmpi-x86_64

if it does not work,

module avail

and then

module load 

note this is per session, so you should do that each time you start a 
new terminal or submit a job

Cheers,

Gilles

- Original Message -

When I run command rpm --query centos-release, it shows the 
following: centos-release-7-3.1611.el7.centos.x86_64. So maybe I should 
install CentOS 5?

 

C.

 

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of 
gil...@rist.or.jp
Sent: Thursday, April 27, 2017 12:36 PM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Help

 

 by the way, are you running CentOS 5 ?

it seems mpi-selector is no more available from CentOS 6

 

Cheers,

 

Gilles

- Original Message -

Yes, I write it wrong the previous e-mail, but actually it does 
not work. Gives the error message: mpi: command not found

 

 

Corina

 

 

 

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf 
Of gil...@rist.or.jp
Sent: Thursday, April 27, 2017 11:34 AM
To: Open MPI Users <users@lists.open-mpi.org>
    Subject: Re: [OMPI users] Help

 

 

 Hi,

 

that looks like a typo, the command is

mpi-selector --list

 

Cheers,

 

Gilles

- Original Message -

Hello,

 

 

I am trying to install Open MPI on Centos and I got stuck. I 
have installed an GNU compiler and after that I run the command: yum 
install openmpi-devel.x86_64. But when I run command mpi selector –- 
list I receive this error “mpi: command not found”

I am following the instruction from here: 
https://na-inet.jp/na/pccluster/centos_x86_64-en.html

Any help is much appreciated. J

 

 

Corina



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help

2017-04-27 Thread Corina Jeni Tudorache
When I run command rpm --query centos-release, it shows the following: 
centos-release-7-3.1611.el7.centos.x86_64. So maybe I should install CentOS 5?
 
C.
 
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of 
gil...@rist.or.jp
Sent: Thursday, April 27, 2017 12:36 PM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Help
 
 by the way, are you running CentOS 5 ?
it seems mpi-selector is no more available from CentOS 6
 
Cheers,
 
Gilles
- Original Message -
Yes, I write it wrong the previous e-mail, but actually it does not work. Gives 
the error message: mpi: command not found
 
 
Corina
 
 
 
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of 
gil...@rist.or.jp
Sent: Thursday, April 27, 2017 11:34 AM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Help
 
 
 Hi,
 
that looks like a typo, the command is
mpi-selector --list
 
Cheers,
 
Gilles
- Original Message -
Hello,
 
 
I am trying to install Open MPI on Centos and I got stuck. I have installed an 
GNU compiler and after that I run the command: yum install 
openmpi-devel.x86_64. But when I run command mpi selector –- list I receive 
this error “mpi: command not found”
I am following the instruction from here: 
https://na-inet.jp/na/pccluster/centos_x86_64-en.html
Any help is much appreciated. J
 
 
Corina___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help

2017-04-27 Thread Corina Jeni Tudorache
It says mpi-selector is not installed. And yes for the mpi-selector command, 
the error message is command not found.
 
C.
 
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of 
gil...@rist.or.jp
Sent: Thursday, April 27, 2017 12:32 PM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Help
 
 Well, i cannot make sense of this error message.
 
if the command is mpi-selector, the error message could be
mpi-selector: command not found
but this is not the error message you reported
 
what does
rpm -ql mpi-selector
reports ?
 
Cheers,
 
Gilles
- Original Message -
Yes, I write it wrong the previous e-mail, but actually it does not work. Gives 
the error message: mpi: command not found
 
 
Corina
 
 
 
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of 
gil...@rist.or.jp
Sent: Thursday, April 27, 2017 11:34 AM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Help
 
 
 Hi,
 
that looks like a typo, the command is
mpi-selector --list
 
Cheers,
 
Gilles
- Original Message -
Hello,
 
 
I am trying to install Open MPI on Centos and I got stuck. I have installed an 
GNU compiler and after that I run the command: yum install 
openmpi-devel.x86_64. But when I run command mpi selector –- list I receive 
this error “mpi: command not found”
I am following the instruction from here: 
https://na-inet.jp/na/pccluster/centos_x86_64-en.html
Any help is much appreciated. J
 
 
Corina___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help

2017-04-27 Thread gilles
 by the way, are you running CentOS 5 ?

it seems mpi-selector is no more available from CentOS 6

Cheers,

Gilles

- Original Message -

Yes, I write it wrong the previous e-mail, but actually it does not 
work. Gives the error message: mpi: command not found

 

Corina

 

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of 
gil...@rist.or.jp
Sent: Thursday, April 27, 2017 11:34 AM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Help

 

 Hi,

 

that looks like a typo, the command is

mpi-selector --list

 

Cheers,

 

Gilles

- Original Message -

Hello,

 

 

I am trying to install Open MPI on Centos and I got stuck. I 
have installed an GNU compiler and after that I run the command: yum 
install openmpi-devel.x86_64. But when I run command mpi selector –- 
list I receive this error “mpi: command not found”

I am following the instruction from here: 
https://na-inet.jp/na/pccluster/centos_x86_64-en.html


Any help is much appreciated. J

 

 

Corina



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help

2017-04-27 Thread gilles
 Well, i cannot make sense of this error message.

if the command is mpi-selector, the error message could be

mpi-selector: command not found

but this is not the error message you reported

what does

rpm -ql mpi-selector

reports ?

Cheers,

Gilles

- Original Message -

Yes, I write it wrong the previous e-mail, but actually it does not 
work. Gives the error message: mpi: command not found

 

Corina

 

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of 
gil...@rist.or.jp
Sent: Thursday, April 27, 2017 11:34 AM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Help

 

 Hi,

 

that looks like a typo, the command is

mpi-selector --list

 

Cheers,

 

Gilles

- Original Message -

Hello,

 

 

I am trying to install Open MPI on Centos and I got stuck. I 
have installed an GNU compiler and after that I run the command: yum 
install openmpi-devel.x86_64. But when I run command mpi selector –- 
list I receive this error “mpi: command not found”

I am following the instruction from here: 
https://na-inet.jp/na/pccluster/centos_x86_64-en.html


Any help is much appreciated. J

 

 

Corina



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help

2017-04-27 Thread Corina Jeni Tudorache
Yes, I write it wrong the previous e-mail, but actually it does not work. Gives 
the error message: mpi: command not found
 
Corina
 
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of 
gil...@rist.or.jp
Sent: Thursday, April 27, 2017 11:34 AM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Help
 
 Hi,
 
that looks like a typo, the command is
mpi-selector --list
 
Cheers,
 
Gilles
- Original Message -
Hello,
 
 
I am trying to install Open MPI on Centos and I got stuck. I have installed an 
GNU compiler and after that I run the command: yum install 
openmpi-devel.x86_64. But when I run command mpi selector –- list I receive 
this error “mpi: command not found”
I am following the instruction from here: 
https://na-inet.jp/na/pccluster/centos_x86_64-en.html
Any help is much appreciated. J
 
 
Corina___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help

2017-04-27 Thread gilles
 Hi,

that looks like a typo, the command is

mpi-selector --list

Cheers,

Gilles

- Original Message -

Hello,

 

I am trying to install Open MPI on Centos and I got stuck. I have 
installed an GNU compiler and after that I run the command: yum install 
openmpi-devel.x86_64. But when I run command mpi selector –- list I 
receive this error “mpi: command not found”

I am following the instruction from here: 
https://na-inet.jp/na/pccluster/centos_x86_64-en.html


Any help is much appreciated. J

 

Corina



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help with Open MPI 2.1.0 and PGI 16.10: Configure and C++

2017-03-24 Thread Matt Thompson
Gilles,

The library I have having issues linking is ESMF and it is a C++/Fortran
application. From
http://www.earthsystemmodeling.org/esmf_releases/non_public/ESMF_7_0_0/ESMF_usrdoc/node9.html#SECTION00092000
:

The following compilers and utilities *are required* for compiling, linking
> and testing the ESMF software:
> Fortran90 (or later) compiler;
> C++ compiler;
> MPI implementation compatible with the above compilers (but see below);
> GNU's gcc compiler - for a standard cpp preprocessor implementation;
> GNU make;
> Perl - for running test scripts.


(Emphasis mine)

This is why I am concerned. For now, I'll build Open MPI with the (possibly
useless) C++ support for PGI and move on to the Fortran issue (which I'll
detail in another email).

But, as I *need* ESMF for my application, it would be good to get an mpicxx
that I can have confidence in with PGI.

Matt


On Thu, Mar 23, 2017 at 9:05 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Matt,
>
> a C++ compiler is required to configure Open MPI.
> That being said, C++ compiler is only used if you build the C++ bindings
> (That were removed from MPI-3)
> And unless you plan to use the mpic++ wrapper (with or without the C++
> bindings),
> a valid C++ compiler is not required at all.
> /* configure still requires one, and that could be improved */
>
> My point is you should not worry too much about configure messages related
> to C++,
> and you should instead focus on the Fortran issue.
>
> Cheers,
>
> Gilles
>
> On Thursday, March 23, 2017, Matt Thompson  wrote:
>
>> All, I'm hoping one of you knows what I might be doing wrong here.  I'm
>> trying to use Open MPI 2.1.0 for PGI 16.10 (Community Edition) on macOS.
>> Now, I built it a la:
>>
>> http://www.pgroup.com/userforum/viewtopic.php?p=21105#21105
>>
>> and found that it built, but the resulting mpifort, etc were just not
>> good. Couldn't even do Hello World.
>>
>> So, I thought I'd start from the beginning. I tried running:
>>
>> configure --disable-wrapper-rpath CC=pgcc CXX=pgc++ FC=pgfortran
>> --prefix=/Users/mathomp4/installed/Compiler/pgi-16.10/openmpi/2.1.0
>> but when I did I saw this:
>>
>> *** C++ compiler and preprocessor
>> checking whether we are using the GNU C++ compiler... yes
>> checking whether pgc++ accepts -g... yes
>> checking dependency style of pgc++... none
>> checking how to run the C++ preprocessor... pgc++ -E
>> checking for the C++ compiler vendor... gnu
>>
>> Well, that's not the right vendor. So, I took a look at configure and I
>> saw that at least some detection for PGI was a la:
>>
>>   pgCC* | pgcpp*)
>> # Portland Group C++ compiler
>> case `$CC -V` in
>> *pgCC\ [1-5].* | *pgcpp\ [1-5].*)
>>
>>   pgCC* | pgcpp*)
>> # Portland Group C++ compiler
>> lt_prog_compiler_wl_CXX='-Wl,'
>> lt_prog_compiler_pic_CXX='-fpic'
>> lt_prog_compiler_static_CXX='-Bstatic'
>> ;;
>>
>> Ah. PGI 16.9+ now use pgc++ to do C++ compiling, not pgcpp. So, I hacked
>> configure so that references to pgCC (nonexistent on macOS) are gone and
>> all pgcpp became pgc++, but:
>>
>> *** C++ compiler and preprocessor
>> checking whether we are using the GNU C++ compiler... yes
>> checking whether pgc++ accepts -g... yes
>> checking dependency style of pgc++... none
>> checking how to run the C++ preprocessor... pgc++ -E
>> checking for the C++ compiler vendor... gnu
>>
>> Well, at this point, I think I'm stopping until I get help. Will this
>> chunk of configure always return gnu for PGI? I know the C part returns
>> 'portland group':
>>
>> *** C compiler and preprocessor
>> checking for gcc... (cached) pgcc
>> checking whether we are using the GNU C compiler... (cached) no
>> checking whether pgcc accepts -g... (cached) yes
>> checking for pgcc option to accept ISO C89... (cached) none needed
>> checking whether pgcc understands -c and -o together... (cached) yes
>> checking for pgcc option to accept ISO C99... none needed
>> checking for the C compiler vendor... portland group
>>
>> so I thought the C++ section would as well. I also tried passing in
>> --enable-mpi-cxx, but that did nothing.
>>
>> Is this just a red herring? My real concern is with pgfortran/mpifort,
>> but I thought I'd start with this. If this is okay, I'll move on and detail
>> the fortran issues I'm having.
>>
>> Matt
>> --
>> Matt Thompson
>>
>> Man Among Men
>> Fulcrum of History
>>
>>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>



-- 
Matt Thompson

Man Among Men
Fulcrum of History
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help with Open MPI 2.1.0 and PGI 16.10: Configure and C++

2017-03-23 Thread Gilles Gouaillardet
Matt,

a C++ compiler is required to configure Open MPI.
That being said, C++ compiler is only used if you build the C++ bindings
(That were removed from MPI-3)
And unless you plan to use the mpic++ wrapper (with or without the C++
bindings),
a valid C++ compiler is not required at all.
/* configure still requires one, and that could be improved */

My point is you should not worry too much about configure messages related
to C++,
and you should instead focus on the Fortran issue.

Cheers,

Gilles

On Thursday, March 23, 2017, Matt Thompson  wrote:

> All, I'm hoping one of you knows what I might be doing wrong here.  I'm
> trying to use Open MPI 2.1.0 for PGI 16.10 (Community Edition) on macOS.
> Now, I built it a la:
>
> http://www.pgroup.com/userforum/viewtopic.php?p=21105#21105
>
> and found that it built, but the resulting mpifort, etc were just not
> good. Couldn't even do Hello World.
>
> So, I thought I'd start from the beginning. I tried running:
>
> configure --disable-wrapper-rpath CC=pgcc CXX=pgc++ FC=pgfortran
> --prefix=/Users/mathomp4/installed/Compiler/pgi-16.10/openmpi/2.1.0
> but when I did I saw this:
>
> *** C++ compiler and preprocessor
> checking whether we are using the GNU C++ compiler... yes
> checking whether pgc++ accepts -g... yes
> checking dependency style of pgc++... none
> checking how to run the C++ preprocessor... pgc++ -E
> checking for the C++ compiler vendor... gnu
>
> Well, that's not the right vendor. So, I took a look at configure and I
> saw that at least some detection for PGI was a la:
>
>   pgCC* | pgcpp*)
> # Portland Group C++ compiler
> case `$CC -V` in
> *pgCC\ [1-5].* | *pgcpp\ [1-5].*)
>
>   pgCC* | pgcpp*)
> # Portland Group C++ compiler
> lt_prog_compiler_wl_CXX='-Wl,'
> lt_prog_compiler_pic_CXX='-fpic'
> lt_prog_compiler_static_CXX='-Bstatic'
> ;;
>
> Ah. PGI 16.9+ now use pgc++ to do C++ compiling, not pgcpp. So, I hacked
> configure so that references to pgCC (nonexistent on macOS) are gone and
> all pgcpp became pgc++, but:
>
> *** C++ compiler and preprocessor
> checking whether we are using the GNU C++ compiler... yes
> checking whether pgc++ accepts -g... yes
> checking dependency style of pgc++... none
> checking how to run the C++ preprocessor... pgc++ -E
> checking for the C++ compiler vendor... gnu
>
> Well, at this point, I think I'm stopping until I get help. Will this
> chunk of configure always return gnu for PGI? I know the C part returns
> 'portland group':
>
> *** C compiler and preprocessor
> checking for gcc... (cached) pgcc
> checking whether we are using the GNU C compiler... (cached) no
> checking whether pgcc accepts -g... (cached) yes
> checking for pgcc option to accept ISO C89... (cached) none needed
> checking whether pgcc understands -c and -o together... (cached) yes
> checking for pgcc option to accept ISO C99... none needed
> checking for the C compiler vendor... portland group
>
> so I thought the C++ section would as well. I also tried passing in
> --enable-mpi-cxx, but that did nothing.
>
> Is this just a red herring? My real concern is with pgfortran/mpifort, but
> I thought I'd start with this. If this is okay, I'll move on and detail the
> fortran issues I'm having.
>
> Matt
> --
> Matt Thompson
>
> Man Among Men
> Fulcrum of History
>
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help with Open MPI 2.1.0 and PGI 16.10: Configure and C++

2017-03-23 Thread Reuti
Hi,

Am 22.03.2017 um 20:12 schrieb Matt Thompson:

> […]
> 
> Ah. PGI 16.9+ now use pgc++ to do C++ compiling, not pgcpp. So, I hacked 
> configure so that references to pgCC (nonexistent on macOS) are gone and all 
> pgcpp became pgc++, but:

This is not unique to macOS. pgCC used STLPort STL and is no longer included 
with their compiler suite, pgc++ now uses a GCC compatible library format and 
replaces the former one on Linux too.

There I get, ignoring the gnu output during `configure` and compiling anyway:

$ mpic++ --version

pgc++ 16.10-0 64-bit target on x86-64 Linux -tp bulldozer
The Portland Group - PGI Compilers and Tools
Copyright (c) 2016, NVIDIA CORPORATION.  All rights reserved.

Maybe some options for the `mpic++` wrapper were just set in a wrong way?

===

Nevertheless: did you see the error on the Mac at the end of the `configure` 
step too, or was it gone after the hints on the discussion's link you posted? 
As I face it there still about "libeevent".

-- Reuti


> 
> *** C++ compiler and preprocessor
> checking whether we are using the GNU C++ compiler... yes
> checking whether pgc++ accepts -g... yes
> checking dependency style of pgc++... none
> checking how to run the C++ preprocessor... pgc++ -E
> checking for the C++ compiler vendor... gnu
> 
> Well, at this point, I think I'm stopping until I get help. Will this chunk 
> of configure always return gnu for PGI? I know the C part returns 'portland 
> group':
> 
> *** C compiler and preprocessor
> checking for gcc... (cached) pgcc
> checking whether we are using the GNU C compiler... (cached) no
> checking whether pgcc accepts -g... (cached) yes
> checking for pgcc option to accept ISO C89... (cached) none needed
> checking whether pgcc understands -c and -o together... (cached) yes
> checking for pgcc option to accept ISO C99... none needed
> checking for the C compiler vendor... portland group
> 
> so I thought the C++ section would as well. I also tried passing in 
> --enable-mpi-cxx, but that did nothing.
> 
> Is this just a red herring? My real concern is with pgfortran/mpifort, but I 
> thought I'd start with this. If this is okay, I'll move on and detail the 
> fortran issues I'm having.
> 
> Matt
> --
> Matt Thompson
> Man Among Men
> Fulcrum of History
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help - Client / server - app hangs in connect/accept by the second or next client that wants to connect to server

2016-10-21 Thread Gilles Gouaillardet
Matus,


This has very likely been fixed by
https://github.com/open-mpi/ompi/pull/2259
Can you download the patch at
https://github.com/open-mpi/ompi/pull/2259.patch and apply it manually on
v1.10 ?

Cheers,

Gilles


On Monday, August 29, 2016, M. D.  wrote:

>
> Hi,
>
> I would like to ask - are there any new solutions or investigations in
> this problem?
>
> Cheers,
>
> Matus Dobrotka
>
> 2016-07-19 15:23 GMT+02:00 Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com
> >:
>
>> my bad for the confusion,
>>
>> I misread you and miswrote my reply.
>>
>> I will investigate this again.
>>
>> strictly speaking, the clients can only start after the server first
>> write the port info to a file.
>> if you start the client right after the server start, they might use
>> incorrect/outdated info and cause all the test hang.
>>
>> I will start reproducing the hang
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On Tuesday, July 19, 2016, M. D. > > wrote:
>>
>>> Yes I understand it, but I think, this is exactly that situation you are
>>> talking about. In my opinion, the test is doing exactly what you said -
>>> when a new player is willing to join, other players must invoke 
>>> MPI_Comm_accept().
>>> All *other* players must invoke MPI_Comm_accept(). Only the last client
>>> (in this case last player which wants to join) does not
>>> invoke MPI_Comm_accept(), because this client invokes only
>>> MPI_Comm_connect(). He is connecting to communicator, in which all other
>>> players are already involved and therefore this last client doesn't have to
>>> invoke MPI_Comm_accept().
>>>
>>> Am I still missing something in this my reflection?
>>>
>>> Matus
>>>
>>> 2016-07-19 10:55 GMT+02:00 Gilles Gouaillardet :
>>>
 here is what the client is doing

 printf("CLIENT: after merging, new comm: size=%d rank=%d\n", size,
 rank) ;

 for (i = rank ; i < num_clients ; i++)
 {
   /* client performs a collective accept */
   CHK(MPI_Comm_accept(server_port_name, MPI_INFO_NULL, 0,
 intracomm, )) ;

   printf("CLIENT: connected to server on port\n") ;
   [...]

 }

 2) has rank 1

 /* and 3) has rank 2) */

 so unless you run 2) with num_clients=2, MPI_Comm_accept() is never
 called, hence my analysis of the crash/hang


 I understand what you are trying to achieve, keep in mind
 MPI_Comm_accept() is a collective call, so when a new player

 is willing to join, other players must invoke MPI_Comm_accept().

 and it is up to you to make sure that happens


 Cheers,


 Gilles

 On 7/19/2016 5:48 PM, M. D. wrote:



 2016-07-19 10:06 GMT+02:00 Gilles Gouaillardet :

> MPI_Comm_accept must be called by all the tasks of the local
> communicator.
>
 Yes, that's how I understand it. In the source code of the test, all
 the tasks call  MPI_Comm_accept - server and also relevant clients.

> so if you
>
> 1) mpirun -np 1 ./singleton_client_server 2 1
>
> 2) mpirun -np 1 ./singleton_client_server 2 0
>
> 3) mpirun -np 1 ./singleton_client_server 2 0
>
> then 3) starts after 2) has exited, so on 1), intracomm is made of 1)
> and an exited task (2)
>
 This is not true in my opinion -  because of above mentioned fact that
 MPI_Comm_accept is called by all the tasks of the local communicator.

> /*
>
> strictly speaking, there is a race condition, if 2) has exited, then
> MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.
>
> if 2) has not yet exited, then the test will hang because 2) does not
> invoke MPI_Comm_accept
>
> */
>
 Task 2) does not exit, because of blocking call of MPI_Comm_accept.

>
>

> there are different ways of seeing things :
>
> 1) this is an incorrect usage of the test, the number of clients
> should be the same everywhere
>
> 2) task 2) should not exit (because it did not call
> MPI_Comm_disconnect()) and the test should hang when
>
> starting task 3) because task 2) does not call MPI_Comm_accept()
>
>
> ad 1) I am sorry, but maybe I do not understand what you think - In my
 previous post I wrote that the number of clients is the same in every
 mpirun instance.
 ad 2) it is the same as above

> i do not know how you want to spawn your tasks.
>
> if 2) and 3) do not need to communicate with each other (they only
> communicate with 1)), then
>
> you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)
>
> if 2 and 3) need to communicate with each other, it would be much
> easier to 

Re: [OMPI users] Help - Client / server - app hangs in connect/accept by the second or next client that wants to connect to server

2016-07-19 Thread Gilles Gouaillardet
my bad for the confusion,

I misread you and miswrote my reply.

I will investigate this again.

strictly speaking, the clients can only start after the server first write
the port info to a file.
if you start the client right after the server start, they might use
incorrect/outdated info and cause all the test hang.

I will start reproducing the hang

Cheers,

Gilles

On Tuesday, July 19, 2016, M. D.  wrote:

> Yes I understand it, but I think, this is exactly that situation you are
> talking about. In my opinion, the test is doing exactly what you said -
> when a new player is willing to join, other players must invoke 
> MPI_Comm_accept().
> All *other* players must invoke MPI_Comm_accept(). Only the last client
> (in this case last player which wants to join) does not
> invoke MPI_Comm_accept(), because this client invokes only
> MPI_Comm_connect(). He is connecting to communicator, in which all other
> players are already involved and therefore this last client doesn't have to
> invoke MPI_Comm_accept().
>
> Am I still missing something in this my reflection?
>
> Matus
>
> 2016-07-19 10:55 GMT+02:00 Gilles Gouaillardet  >:
>
>> here is what the client is doing
>>
>> printf("CLIENT: after merging, new comm: size=%d rank=%d\n", size,
>> rank) ;
>>
>> for (i = rank ; i < num_clients ; i++)
>> {
>>   /* client performs a collective accept */
>>   CHK(MPI_Comm_accept(server_port_name, MPI_INFO_NULL, 0, intracomm,
>> )) ;
>>
>>   printf("CLIENT: connected to server on port\n") ;
>>   [...]
>>
>> }
>>
>> 2) has rank 1
>>
>> /* and 3) has rank 2) */
>>
>> so unless you run 2) with num_clients=2, MPI_Comm_accept() is never
>> called, hence my analysis of the crash/hang
>>
>>
>> I understand what you are trying to achieve, keep in mind
>> MPI_Comm_accept() is a collective call, so when a new player
>>
>> is willing to join, other players must invoke MPI_Comm_accept().
>>
>> and it is up to you to make sure that happens
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
>> On 7/19/2016 5:48 PM, M. D. wrote:
>>
>>
>>
>> 2016-07-19 10:06 GMT+02:00 Gilles Gouaillardet > >:
>>
>>> MPI_Comm_accept must be called by all the tasks of the local
>>> communicator.
>>>
>> Yes, that's how I understand it. In the source code of the test, all the
>> tasks call  MPI_Comm_accept - server and also relevant clients.
>>
>>> so if you
>>>
>>> 1) mpirun -np 1 ./singleton_client_server 2 1
>>>
>>> 2) mpirun -np 1 ./singleton_client_server 2 0
>>>
>>> 3) mpirun -np 1 ./singleton_client_server 2 0
>>>
>>> then 3) starts after 2) has exited, so on 1), intracomm is made of 1)
>>> and an exited task (2)
>>>
>> This is not true in my opinion -  because of above mentioned fact that
>> MPI_Comm_accept is called by all the tasks of the local communicator.
>>
>>> /*
>>>
>>> strictly speaking, there is a race condition, if 2) has exited, then
>>> MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.
>>>
>>> if 2) has not yet exited, then the test will hang because 2) does not
>>> invoke MPI_Comm_accept
>>>
>>> */
>>>
>> Task 2) does not exit, because of blocking call of MPI_Comm_accept.
>>
>>>
>>>
>>
>>> there are different ways of seeing things :
>>>
>>> 1) this is an incorrect usage of the test, the number of clients should
>>> be the same everywhere
>>>
>>> 2) task 2) should not exit (because it did not call
>>> MPI_Comm_disconnect()) and the test should hang when
>>>
>>> starting task 3) because task 2) does not call MPI_Comm_accept()
>>>
>>>
>>> ad 1) I am sorry, but maybe I do not understand what you think - In my
>> previous post I wrote that the number of clients is the same in every
>> mpirun instance.
>> ad 2) it is the same as above
>>
>>> i do not know how you want to spawn your tasks.
>>>
>>> if 2) and 3) do not need to communicate with each other (they only
>>> communicate with 1)), then
>>>
>>> you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)
>>>
>>> if 2 and 3) need to communicate with each other, it would be much easier
>>> to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),
>>>
>>> so there is only one inter communicator with all the tasks.
>>>
>> My aim is that all the tasks need to communicate with each other. I am
>> implementing a distributed application - game with more players
>> communicating with each other via MPI. It should work as follows - First
>> player creates a game and waits for other players to connect to this game.
>> On different computers (in the same network) the other players can join
>> this game. When they are connected, they should be able to play this game
>> together.
>> I hope, it is clear what my idea is. If it is not, just ask me, please.
>>
>>>
>>> The current test program is growing incrementally the intercomm, which
>>> does require extra steps for synchronization.
>>>
>>>
>>> Cheers,

Re: [OMPI users] Help - Client / server - app hangs in connect/accept by the second or next client that wants to connect to server

2016-07-19 Thread M. D.
Yes I understand it, but I think, this is exactly that situation you are
talking about. In my opinion, the test is doing exactly what you said -
when a new player is willing to join, other players must invoke
MPI_Comm_accept().
All *other* players must invoke MPI_Comm_accept(). Only the last client (in
this case last player which wants to join) does not
invoke MPI_Comm_accept(), because this client invokes only
MPI_Comm_connect(). He is connecting to communicator, in which all other
players are already involved and therefore this last client doesn't have to
invoke MPI_Comm_accept().

Am I still missing something in this my reflection?

Matus

2016-07-19 10:55 GMT+02:00 Gilles Gouaillardet :

> here is what the client is doing
>
> printf("CLIENT: after merging, new comm: size=%d rank=%d\n", size,
> rank) ;
>
> for (i = rank ; i < num_clients ; i++)
> {
>   /* client performs a collective accept */
>   CHK(MPI_Comm_accept(server_port_name, MPI_INFO_NULL, 0, intracomm,
> )) ;
>
>   printf("CLIENT: connected to server on port\n") ;
>   [...]
>
> }
>
> 2) has rank 1
>
> /* and 3) has rank 2) */
>
> so unless you run 2) with num_clients=2, MPI_Comm_accept() is never
> called, hence my analysis of the crash/hang
>
>
> I understand what you are trying to achieve, keep in mind
> MPI_Comm_accept() is a collective call, so when a new player
>
> is willing to join, other players must invoke MPI_Comm_accept().
>
> and it is up to you to make sure that happens
>
>
> Cheers,
>
>
> Gilles
>
> On 7/19/2016 5:48 PM, M. D. wrote:
>
>
>
> 2016-07-19 10:06 GMT+02:00 Gilles Gouaillardet :
>
>> MPI_Comm_accept must be called by all the tasks of the local communicator.
>>
> Yes, that's how I understand it. In the source code of the test, all the
> tasks call  MPI_Comm_accept - server and also relevant clients.
>
>> so if you
>>
>> 1) mpirun -np 1 ./singleton_client_server 2 1
>>
>> 2) mpirun -np 1 ./singleton_client_server 2 0
>>
>> 3) mpirun -np 1 ./singleton_client_server 2 0
>>
>> then 3) starts after 2) has exited, so on 1), intracomm is made of 1) and
>> an exited task (2)
>>
> This is not true in my opinion -  because of above mentioned fact that
> MPI_Comm_accept is called by all the tasks of the local communicator.
>
>> /*
>>
>> strictly speaking, there is a race condition, if 2) has exited, then
>> MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.
>>
>> if 2) has not yet exited, then the test will hang because 2) does not
>> invoke MPI_Comm_accept
>>
>> */
>>
> Task 2) does not exit, because of blocking call of MPI_Comm_accept.
>
>>
>>
>
>> there are different ways of seeing things :
>>
>> 1) this is an incorrect usage of the test, the number of clients should
>> be the same everywhere
>>
>> 2) task 2) should not exit (because it did not call
>> MPI_Comm_disconnect()) and the test should hang when
>>
>> starting task 3) because task 2) does not call MPI_Comm_accept()
>>
>>
>> ad 1) I am sorry, but maybe I do not understand what you think - In my
> previous post I wrote that the number of clients is the same in every
> mpirun instance.
> ad 2) it is the same as above
>
>> i do not know how you want to spawn your tasks.
>>
>> if 2) and 3) do not need to communicate with each other (they only
>> communicate with 1)), then
>>
>> you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)
>>
>> if 2 and 3) need to communicate with each other, it would be much easier
>> to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),
>>
>> so there is only one inter communicator with all the tasks.
>>
> My aim is that all the tasks need to communicate with each other. I am
> implementing a distributed application - game with more players
> communicating with each other via MPI. It should work as follows - First
> player creates a game and waits for other players to connect to this game.
> On different computers (in the same network) the other players can join
> this game. When they are connected, they should be able to play this game
> together.
> I hope, it is clear what my idea is. If it is not, just ask me, please.
>
>>
>> The current test program is growing incrementally the intercomm, which
>> does require extra steps for synchronization.
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
> Cheers,
>
> Matus
>
>> On 7/19/2016 4:37 PM, M. D. wrote:
>>
>> Hi,
>> thank you for your interest in this topic.
>>
>> So, I normally run the test as follows:
>> Firstly, I run "server" (second parameter is 1):
>> *mpirun -np 1 ./singleton_client_server number_of_clients 1*
>>
>> Secondly, I run corresponding number of "clients" via following command:
>> *mpirun -np 1 ./singleton_client_server number_of_clients 0*
>>
>> So, for example with 3 clients I do:
>> mpirun -np 1 ./singleton_client_server 3 1
>> mpirun -np 1 ./singleton_client_server 3 0
>> mpirun -np 1 ./singleton_client_server 3 0
>> mpirun -np 1 ./singleton_client_server 3 0
>>
>> It means you are right - there 

Re: [OMPI users] Help - Client / server - app hangs in connect/accept by the second or next client that wants to connect to server

2016-07-19 Thread Gilles Gouaillardet

here is what the client is doing

printf("CLIENT: after merging, new comm: size=%d rank=%d\n", size, 
rank) ;


for (i = rank ; i < num_clients ; i++)
{
  /* client performs a collective accept */
  CHK(MPI_Comm_accept(server_port_name, MPI_INFO_NULL, 0, 
intracomm, )) ;


  printf("CLIENT: connected to server on port\n") ;
  [...]

}

2) has rank 1

/* and 3) has rank 2) */

so unless you run 2) with num_clients=2, MPI_Comm_accept() is never 
called, hence my analysis of the crash/hang



I understand what you are trying to achieve, keep in mind 
MPI_Comm_accept() is a collective call, so when a new player


is willing to join, other players must invoke MPI_Comm_accept().

and it is up to you to make sure that happens


Cheers,


Gilles


On 7/19/2016 5:48 PM, M. D. wrote:



2016-07-19 10:06 GMT+02:00 Gilles Gouaillardet >:


MPI_Comm_accept must be called by all the tasks of the local
communicator.

Yes, that's how I understand it. In the source code of the test, all 
the tasks call  MPI_Comm_accept - server and also relevant clients.


so if you

1) mpirun -np 1 ./singleton_client_server 2 1

2) mpirun -np 1 ./singleton_client_server 2 0

3) mpirun -np 1 ./singleton_client_server 2 0

then 3) starts after 2) has exited, so on 1), intracomm is made of
1) and an exited task (2)

This is not true in my opinion -  because of above mentioned fact that 
MPI_Comm_accept is called by all the tasks of the local communicator.


/*

strictly speaking, there is a race condition, if 2) has exited,
then MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.

if 2) has not yet exited, then the test will hang because 2) does
not invoke MPI_Comm_accept

*/

Task 2) does not exit, because of blocking call of MPI_Comm_accept.


there are different ways of seeing things :

1) this is an incorrect usage of the test, the number of clients
should be the same everywhere

2) task 2) should not exit (because it did not call
MPI_Comm_disconnect()) and the test should hang when

starting task 3) because task 2) does not call MPI_Comm_accept()


ad 1) I am sorry, but maybe I do not understand what you think - In my 
previous post I wrote that the number of clients is the same in every 
mpirun instance.

ad 2) it is the same as above

i do not know how you want to spawn your tasks.

if 2) and 3) do not need to communicate with each other (they only
communicate with 1)), then

you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)

if 2 and 3) need to communicate with each other, it would be much
easier to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),

so there is only one inter communicator with all the tasks.

My aim is that all the tasks need to communicate with each other. I am 
implementing a distributed application - game with more players 
communicating with each other via MPI. It should work as follows - 
First player creates a game and waits for other players to connect to 
this game. On different computers (in the same network) the other 
players can join this game. When they are connected, they should be 
able to play this game together.

I hope, it is clear what my idea is. If it is not, just ask me, please.


The current test program is growing incrementally the intercomm,
which does require extra steps for synchronization.


Cheers,


Gilles

Cheers,

Matus

On 7/19/2016 4:37 PM, M. D. wrote:

Hi,
thank you for your interest in this topic.

So, I normally run the test as follows:
Firstly, I run "server" (second parameter is 1):
*mpirun -np 1 ./singleton_client_server number_of_clients 1*
*
*
Secondly, I run corresponding number of "clients" via following
command:
*mpirun -np 1 ./singleton_client_server number_of_clients 0*
*
*
So, for example with 3 clients I do:
mpirun -np 1 ./singleton_client_server 3 1
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0

It means you are right - there should be the same number of
clients in each mpirun instance.

The test does not involve MPI_Comm_disconnect(), but the problem
in the test is in the earlier position, because some of clients
(in the most cases actually the last client) cannot sometimes
connect to the server and therefore all clients with server are
hanging (waiting for the connections with the last client(s) ).

So, the bahaviour of accept/connect method is a bit confusing for
me.
If I understand you, according to your post - the problem is not
in the timeout value, isn't it?

Cheers,

Matus

2016-07-19 6:28 GMT+02:00 Gilles Gouaillardet >:

How do you run the test ?

you should 

Re: [OMPI users] Help - Client / server - app hangs in connect/accept by the second or next client that wants to connect to server

2016-07-19 Thread M. D.
2016-07-19 10:06 GMT+02:00 Gilles Gouaillardet :

> MPI_Comm_accept must be called by all the tasks of the local communicator.
>
Yes, that's how I understand it. In the source code of the test, all the
tasks call  MPI_Comm_accept - server and also relevant clients.

> so if you
>
> 1) mpirun -np 1 ./singleton_client_server 2 1
>
> 2) mpirun -np 1 ./singleton_client_server 2 0
>
> 3) mpirun -np 1 ./singleton_client_server 2 0
>
> then 3) starts after 2) has exited, so on 1), intracomm is made of 1) and
> an exited task (2)
>
This is not true in my opinion -  because of above mentioned fact that
MPI_Comm_accept is called by all the tasks of the local communicator.

> /*
>
> strictly speaking, there is a race condition, if 2) has exited, then
> MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.
>
> if 2) has not yet exited, then the test will hang because 2) does not
> invoke MPI_Comm_accept
>
> */
>
Task 2) does not exit, because of blocking call of MPI_Comm_accept.

>
>

> there are different ways of seeing things :
>
> 1) this is an incorrect usage of the test, the number of clients should be
> the same everywhere
>
> 2) task 2) should not exit (because it did not call MPI_Comm_disconnect())
> and the test should hang when
>
> starting task 3) because task 2) does not call MPI_Comm_accept()
>
>
> ad 1) I am sorry, but maybe I do not understand what you think - In my
previous post I wrote that the number of clients is the same in every
mpirun instance.
ad 2) it is the same as above

> i do not know how you want to spawn your tasks.
>
> if 2) and 3) do not need to communicate with each other (they only
> communicate with 1)), then
>
> you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)
>
> if 2 and 3) need to communicate with each other, it would be much easier
> to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),
>
> so there is only one inter communicator with all the tasks.
>
My aim is that all the tasks need to communicate with each other. I am
implementing a distributed application - game with more players
communicating with each other via MPI. It should work as follows - First
player creates a game and waits for other players to connect to this game.
On different computers (in the same network) the other players can join
this game. When they are connected, they should be able to play this game
together.
I hope, it is clear what my idea is. If it is not, just ask me, please.

>
> The current test program is growing incrementally the intercomm, which
> does require extra steps for synchronization.
>
>
> Cheers,
>
>
> Gilles
>
Cheers,

Matus

> On 7/19/2016 4:37 PM, M. D. wrote:
>
> Hi,
> thank you for your interest in this topic.
>
> So, I normally run the test as follows:
> Firstly, I run "server" (second parameter is 1):
> *mpirun -np 1 ./singleton_client_server number_of_clients 1*
>
> Secondly, I run corresponding number of "clients" via following command:
> *mpirun -np 1 ./singleton_client_server number_of_clients 0*
>
> So, for example with 3 clients I do:
> mpirun -np 1 ./singleton_client_server 3 1
> mpirun -np 1 ./singleton_client_server 3 0
> mpirun -np 1 ./singleton_client_server 3 0
> mpirun -np 1 ./singleton_client_server 3 0
>
> It means you are right - there should be the same number of clients in
> each mpirun instance.
>
> The test does not involve MPI_Comm_disconnect(), but the problem in the
> test is in the earlier position, because some of clients (in the most cases
> actually the last client) cannot sometimes connect to the server and
> therefore all clients with server are hanging (waiting for the connections
> with the last client(s) ).
>
> So, the bahaviour of accept/connect method is a bit confusing for me.
> If I understand you, according to your post - the problem is not in the
> timeout value, isn't it?
>
> Cheers,
>
> Matus
>
> 2016-07-19 6:28 GMT+02:00 Gilles Gouaillardet :
>
>> How do you run the test ?
>>
>> you should have the same number of clients in each mpirun instance, the
>> following simple shell starts the test as i think it is supposed to
>>
>> note the test itself is arguable since MPI_Comm_disconnect() is never
>> invoked
>>
>> (and you will observe some related dpm_base_disconnect_init errors)
>>
>>
>> #!/bin/sh
>>
>> clients=3
>>
>> screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients 1
>> 2>&1 | tee /tmp/server.$clients"
>> for i in $(seq $clients); do
>>
>> sleep 1
>>
>> screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients 0
>> 2>&1 | tee /tmp/client.$clients.$i"
>> done
>>
>>
>> Ralph,
>>
>>
>> this test fails with master.
>>
>> when the "server" (second parameter is 1), MPI_Comm_accept() fails with a
>> timeout.
>>
>> i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout
>>
>> OPAL_PMIX_EXCHANGE(rc, , , 60);
>>
>> but this is not the timeout that is triggered ...
>>
>> the eviction_cbfunc timeout function is invoked, and it has been set when
>> 

Re: [OMPI users] Help - Client / server - app hangs in connect/accept by the second or next client that wants to connect to server

2016-07-19 Thread Gilles Gouaillardet

MPI_Comm_accept must be called by all the tasks of the local communicator.

so if you

1) mpirun -np 1 ./singleton_client_server 2 1

2) mpirun -np 1 ./singleton_client_server 2 0

3) mpirun -np 1 ./singleton_client_server 2 0

then 3) starts after 2) has exited, so on 1), intracomm is made of 1) 
and an exited task (2)


/*

strictly speaking, there is a race condition, if 2) has exited, then 
MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.


if 2) has not yet exited, then the test will hang because 2) does not 
invoke MPI_Comm_accept


*/


there are different ways of seeing things :

1) this is an incorrect usage of the test, the number of clients should 
be the same everywhere


2) task 2) should not exit (because it did not call 
MPI_Comm_disconnect()) and the test should hang when


starting task 3) because task 2) does not call MPI_Comm_accept()


i do not know how you want to spawn your tasks.

if 2) and 3) do not need to communicate with each other (they only 
communicate with 1)), then


you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)

if 2 and 3) need to communicate with each other, it would be much easier 
to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),


so there is only one inter communicator with all the tasks.


The current test program is growing incrementally the intercomm, which 
does require extra steps for synchronization.



Cheers,


Gilles

On 7/19/2016 4:37 PM, M. D. wrote:

Hi,
thank you for your interest in this topic.

So, I normally run the test as follows:
Firstly, I run "server" (second parameter is 1):
*mpirun -np 1 ./singleton_client_server number_of_clients 1*
*
*
Secondly, I run corresponding number of "clients" via following command:
*mpirun -np 1 ./singleton_client_server number_of_clients 0*
*
*
So, for example with 3 clients I do:
mpirun -np 1 ./singleton_client_server 3 1
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0

It means you are right - there should be the same number of clients in 
each mpirun instance.


The test does not involve MPI_Comm_disconnect(), but the problem in 
the test is in the earlier position, because some of clients (in the 
most cases actually the last client) cannot sometimes connect to the 
server and therefore all clients with server are hanging (waiting for 
the connections with the last client(s) ).


So, the bahaviour of accept/connect method is a bit confusing for me.
If I understand you, according to your post - the problem is not in 
the timeout value, isn't it?


Cheers,

Matus

2016-07-19 6:28 GMT+02:00 Gilles Gouaillardet >:


How do you run the test ?

you should have the same number of clients in each mpirun
instance, the following simple shell starts the test as i think it
is supposed to

note the test itself is arguable since MPI_Comm_disconnect() is
never invoked

(and you will observe some related dpm_base_disconnect_init errors)


#!/bin/sh

clients=3

screen -d -m sh -c "mpirun -np 1 ./singleton_client_server
$clients 1 2>&1 | tee /tmp/server.$clients"
for i in $(seq $clients); do

sleep 1

screen -d -m sh -c "mpirun -np 1 ./singleton_client_server
$clients 0 2>&1 | tee /tmp/client.$clients.$i"
done


Ralph,


this test fails with master.

when the "server" (second parameter is 1), MPI_Comm_accept() fails
with a timeout.

i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout

OPAL_PMIX_EXCHANGE(rc, , , 60);

but this is not the timeout that is triggered ...

the eviction_cbfunc timeout function is invoked, and it has been
set when opal_hotel_init() was invoked in
orte/orted/pmix/pmix_server.c


default timeout is 2 seconds, but in this case (user invokes
MPI_Comm_accept), i guess the timeout should be infinite or 60
seconds (hard coded value described above)

sadly, if i set a higher timeout value (mpirun --mca
orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not
return when the client invokes MPI_Comm_connect()


could you please have a look at this ?


Cheers,


Gilles


On 7/15/2016 9:20 PM, M. D. wrote:

Hello,

I have a problem with basic client - server application. I tried
to run C program from this website

https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c
I saw this program mentioned in many discussions in your website,
so I expected that it should work properly, but after more
testing I found out that there is probably an error somewhere in
connect/accept method. I have read many discussions and threads
on your website, but I have not found similar problem that I am
facing. It seems that nobody had similar problem like me. When I
run this app with one server and more clients (3,4,5,6,...)
   

Re: [OMPI users] Help - Client / server - app hangs in connect/accept by the second or next client that wants to connect to server

2016-07-19 Thread M. D.
Hi,
thank you for your interest in this topic.

So, I normally run the test as follows:
Firstly, I run "server" (second parameter is 1):
*mpirun -np 1 ./singleton_client_server number_of_clients 1*

Secondly, I run corresponding number of "clients" via following command:
*mpirun -np 1 ./singleton_client_server number_of_clients 0*

So, for example with 3 clients I do:
mpirun -np 1 ./singleton_client_server 3 1
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0

It means you are right - there should be the same number of clients in each
mpirun instance.

The test does not involve MPI_Comm_disconnect(), but the problem in the
test is in the earlier position, because some of clients (in the most cases
actually the last client) cannot sometimes connect to the server and
therefore all clients with server are hanging (waiting for the connections
with the last client(s) ).

So, the bahaviour of accept/connect method is a bit confusing for me.
If I understand you, according to your post - the problem is not in the
timeout value, isn't it?

Cheers,

Matus

2016-07-19 6:28 GMT+02:00 Gilles Gouaillardet :

> How do you run the test ?
>
> you should have the same number of clients in each mpirun instance, the
> following simple shell starts the test as i think it is supposed to
>
> note the test itself is arguable since MPI_Comm_disconnect() is never
> invoked
>
> (and you will observe some related dpm_base_disconnect_init errors)
>
>
> #!/bin/sh
>
> clients=3
>
> screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients 1
> 2>&1 | tee /tmp/server.$clients"
> for i in $(seq $clients); do
>
> sleep 1
>
> screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients 0
> 2>&1 | tee /tmp/client.$clients.$i"
> done
>
>
> Ralph,
>
>
> this test fails with master.
>
> when the "server" (second parameter is 1), MPI_Comm_accept() fails with a
> timeout.
>
> i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout
>
> OPAL_PMIX_EXCHANGE(rc, , , 60);
>
> but this is not the timeout that is triggered ...
>
> the eviction_cbfunc timeout function is invoked, and it has been set when
> opal_hotel_init() was invoked in orte/orted/pmix/pmix_server.c
>
>
> default timeout is 2 seconds, but in this case (user invokes
> MPI_Comm_accept), i guess the timeout should be infinite or 60 seconds
> (hard coded value described above)
>
> sadly, if i set a higher timeout value (mpirun --mca
> orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not return when
> the client invokes MPI_Comm_connect()
>
>
> could you please have a look at this ?
>
>
> Cheers,
>
>
> Gilles
>
> On 7/15/2016 9:20 PM, M. D. wrote:
>
> Hello,
>
> I have a problem with basic client - server application. I tried to run C
> program from this website
> 
> https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c
> I saw this program mentioned in many discussions in your website, so I
> expected that it should work properly, but after more testing I found out
> that there is probably an error somewhere in connect/accept method. I have
> read many discussions and threads on your website, but I have not found
> similar problem that I am facing. It seems that nobody had similar problem
> like me. When I run this app with one server and more clients (3,4,5,6,...)
> sometimes the app hangs. It hangs when second or next client wants to
> connect to the server (it depends, sometimes third client hangs, sometimes
> fourth, sometimes second, and so on).
> So it means that app starts to hang where server waits for accept and
> client waits for connect. And it is not possible to continue, because this
> client cannot connect to the server. It is strange, because I observed this
> behaviour only in some cases... Sometimes it works without any problems,
> sometimes it does not work. The behaviour is unpredictable and not stable.
>
> I have installed openmpi 1.10.2 on my Fedora 19. I have the same problem
> with Java alternative of this application. It hangs also sometimes... I
> need this app in Java, but firstly it must work properly in C
> implementation. Because of this strange behaviour I assume that there can
> be an error maybe inside of openmpi implementation of connect/accept
> methods. I tried it also with another version of openmpi - 1.8.1. However,
> the problem did not disappear.
>
> Could you help me, what can cause the problem? Maybe I did not get
> something about openmpi (or connect/server) and the problem is with me... I
> will appreciate any your help, support, or interest about this topic.
>
> Best regards,
> Matus Dobrotka
>
>
> ___
> users mailing listus...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 

Re: [OMPI users] Help - Client / server - app hangs in connect/accept by the second or next client that wants to connect to server

2016-07-19 Thread Gilles Gouaillardet

How do you run the test ?

you should have the same number of clients in each mpirun instance, the 
following simple shell starts the test as i think it is supposed to


note the test itself is arguable since MPI_Comm_disconnect() is never 
invoked


(and you will observe some related dpm_base_disconnect_init errors)


#!/bin/sh

clients=3

screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients 
1 2>&1 | tee /tmp/server.$clients"

for i in $(seq $clients); do

sleep 1

screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients 
0 2>&1 | tee /tmp/client.$clients.$i"

done


Ralph,


this test fails with master.

when the "server" (second parameter is 1), MPI_Comm_accept() fails with 
a timeout.


i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout

OPAL_PMIX_EXCHANGE(rc, , , 60);

but this is not the timeout that is triggered ...

the eviction_cbfunc timeout function is invoked, and it has been set 
when opal_hotel_init() was invoked in orte/orted/pmix/pmix_server.c



default timeout is 2 seconds, but in this case (user invokes 
MPI_Comm_accept), i guess the timeout should be infinite or 60 seconds 
(hard coded value described above)


sadly, if i set a higher timeout value (mpirun --mca 
orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not return 
when the client invokes MPI_Comm_connect()



could you please have a look at this ?


Cheers,


Gilles


On 7/15/2016 9:20 PM, M. D. wrote:

Hello,

I have a problem with basic client - server application. I tried to 
run C program from this website 
https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c
I saw this program mentioned in many discussions in your website, so I 
expected that it should work properly, but after more testing I found 
out that there is probably an error somewhere in connect/accept 
method. I have read many discussions and threads on your website, but 
I have not found similar problem that I am facing. It seems that 
nobody had similar problem like me. When I run this app with one 
server and more clients (3,4,5,6,...) sometimes the app hangs. It 
hangs when second or next client wants to connect to the server (it 
depends, sometimes third client hangs, sometimes fourth, sometimes 
second, and so on).
So it means that app starts to hang where server waits for accept and 
client waits for connect. And it is not possible to continue, because 
this client cannot connect to the server. It is strange, because I 
observed this behaviour only in some cases... Sometimes it works 
without any problems, sometimes it does not work. The behaviour is 
unpredictable and not stable.


I have installed openmpi 1.10.2 on my Fedora 19. I have the same 
problem with Java alternative of this application. It hangs also 
sometimes... I need this app in Java, but firstly it must work 
properly in C implementation. Because of this strange behaviour I 
assume that there can be an error maybe inside of openmpi 
implementation of connect/accept methods. I tried it also with another 
version of openmpi - 1.8.1. However, the problem did not disappear.


Could you help me, what can cause the problem? Maybe I did not get 
something about openmpi (or connect/server) and the problem is with 
me... I will appreciate any your help, support, or interest about this 
topic.


Best regards,
Matus Dobrotka


___
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/07/29673.php




Re: [OMPI users] Help on Windows

2016-02-23 Thread Walt Brainerd
Thank you, Gilles! It's amazing to get such help.

It seems to work when I unplugged the ethernet
and have the wireless on, but I will check it out
further (including the firewall situation) to pin it down.

 time mpirun -np 4 ./a
 Hello from   1 out of   4 images.
 Hello from   2 out of   4 images.
 Hello from   3 out of   4 images.
 Hello from   4 out of   4 images.

real0m0.774s   ---
user0m0.341s
sys 0m0.933s

On Tue, Feb 23, 2016 at 4:26 PM, Gilles Gouaillardet 
wrote:

> Walt,
>
> generally speaking, that kind of things happen when you are using a
> wireless network and/or a firewall.
>
> so i recommend you first try to disconnect all your networks and see how
> things get improved
>
> Cheers,
>
> Gilles
>
>
> On 2/24/2016 5:08 AM, Walt Brainerd wrote:
>
> I am running up-to-date cygwin on W10 on a 4x i5 processor,
> including gcc 5.3.
>
> I built libcaf_mpi.a provided by OpenCoarrays.
>
> $ cat hello.f90
> program hello
>
>implicit none
>
>print *, "Hello from", this_image(), &
> "out of", num_images(), "images."
>
> end program hello
>
> I compiled the hello.f90 with
>
> $ mpifort -fcoarray=lib hello.f90 libcaf_mpi.a
>
> and ran it with
>
> $ time mpirun -np 4 ./a
>  Hello from   1 out of   4 images.
>  Hello from   2 out of   4 images.
>  Hello from   3 out of   4 images.
>  Hello from   4 out of   4 images.
>
> real0m42.733s   ! <
> user0m0.201s
> sys 0m0.934s
>
> So I am getting this long startup delay. The same thing
> happens with other coarray programs.
>
> Any ideas? BTW, I know almost nothing about MPI :-(.
>
> Thanks.
>
> --
> Walt Brainerd
>
>
> ___
> users mailing listus...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/02/28569.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/02/28570.php
>



-- 
Walt Brainerd


Re: [OMPI users] Help on Windows

2016-02-23 Thread Gilles Gouaillardet

Walt,

generally speaking, that kind of things happen when you are using a 
wireless network and/or a firewall.


so i recommend you first try to disconnect all your networks and see how 
things get improved


Cheers,

Gilles

On 2/24/2016 5:08 AM, Walt Brainerd wrote:

I am running up-to-date cygwin on W10 on a 4x i5 processor,
including gcc 5.3.

I built libcaf_mpi.a provided by OpenCoarrays.

$ cat hello.f90
program hello

   implicit none

   print *, "Hello from", this_image(), &
"out of", num_images(), "images."

end program hello

I compiled the hello.f90 with

$ mpifort -fcoarray=lib hello.f90 libcaf_mpi.a

and ran it with

$ time mpirun -np 4 ./a
 Hello from   1 out of   4 images.
 Hello from   2 out of   4 images.
 Hello from   3 out of   4 images.
 Hello from   4 out of   4 images.

real0m42.733s   ! <
user0m0.201s
sys 0m0.934s

So I am getting this long startup delay. The same thing
happens with other coarray programs.

Any ideas? BTW, I know almost nothing about MPI :-(.

Thanks.

--
Walt Brainerd


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/02/28569.php




Re: [OMPI users] Help with OpenMPI and Univa Grid Engine

2016-02-09 Thread Rahul Pisharody
Hello Ralph, Dave,

Thank you for your suggestions. Let me check on the nfs mounts.

The problem is I am not the grid administrator. I'm working with the grid
administrator to get it resolved. If I had my way, I would be probably
using Sun Grid.

Thank you Dave for pointing out something that I had missed. Let me ask the
admin to check with the Univa guys as well.

Thanks,
Rahul

On Tue, Feb 9, 2016 at 4:43 AM, Dave Love  wrote:

> Rahul Pisharody  writes:
>
> > Hello all,
> >
> > I'm trying to get a simple program (print the hostname of the executing
> > machine) compiled with openmpi run across multiple machines on Univa Grid
> > Engine.
> >
> > This particular configuration has many of the ports blocked. My run
> command
> > has the mca options necessary to limit the ports to the known open ports.
> >
> > However, when I launch the program with mpirun, I get the following error
> > messages:
> >
> > +
> >> error: executing task of job 23 failed: execution daemon on host
> >> "" didn't accept task
>
> So you have a grid engine problem and you're paying Univa a load of
> money (with one of the selling points being MPI support, if I recall
> correctly)...
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/02/28473.php
>


Re: [OMPI users] Help with OpenMPI and Univa Grid Engine

2016-02-09 Thread Dave Love
Rahul Pisharody  writes:

> Hello all,
>
> I'm trying to get a simple program (print the hostname of the executing
> machine) compiled with openmpi run across multiple machines on Univa Grid
> Engine.
>
> This particular configuration has many of the ports blocked. My run command
> has the mca options necessary to limit the ports to the known open ports.
>
> However, when I launch the program with mpirun, I get the following error
> messages:
>
> +
>> error: executing task of job 23 failed: execution daemon on host
>> "" didn't accept task

So you have a grid engine problem and you're paying Univa a load of
money (with one of the selling points being MPI support, if I recall
correctly)...


Re: [OMPI users] Help with OpenMPI and Univa Grid Engine

2016-02-08 Thread Ralph Castain
Is your OMPI installed on an NFS partition? If so, is it in the same mount 
point on all nodes?

Most likely problem is that the required libraries were not found on the remote 
node

> On Feb 8, 2016, at 10:45 AM, Rahul Pisharody  wrote:
> 
> Hello all, 
> 
> I'm trying to get a simple program (print the hostname of the executing 
> machine) compiled with openmpi run across multiple machines on Univa Grid 
> Engine. 
> 
> This particular configuration has many of the ports blocked. My run command 
> has the mca options necessary to limit the ports to the known open ports.
> 
> However, when I launch the program with mpirun, I get the following error 
> messages:
> 
> +
> error: executing task of job 23 failed: execution daemon on host "" 
> didn't accept task
> --
> A daemon (pid 10126) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
>  
> There may be more information reported by the environment (see above).
>  
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> error: executing task of job 23 failed: execution daemon on host "machine" 
> didn't accept task
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> 
> 
> I've set the LD_LIBRARY_PATH and I've verified that path points to the 
> necessary shared libraries.
> 
> Any idea/suggestion as to what is happening here will be greatly appreciated.
> 
> Thanks,
> Rahul
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/02/28467.php



Re: [OMPI users] Help with Binding in 1.8.8: Use only second socket

2015-12-21 Thread Saliya Ekanayake
I tried the following with OpenMPI 1.8.1 and 1.10.1. The both worked. In my
case a node has 2 sockets like yours, but each socket has 12 cores and
lstopo showed core numbers for the second socket are from 12 to 23.

* mpirun --report-bindings --bind-to core --cpu-set 12,13,14,15,16,17,18,19
-np 8 java Hello*

[j-049:182867] MCW rank 0 bound to socket 1[core 12[hwt 0-1]]:
[../../../../../../../../../../../..][BB/../../../../../../../../../../..]
[j-049:182867] MCW rank 1 bound to socket 1[core 13[hwt 0-1]]:
[../../../../../../../../../../../..][../BB/../../../../../../../../../..]
[j-049:182867] MCW rank 2 bound to socket 1[core 14[hwt 0-1]]:
[../../../../../../../../../../../..][../../BB/../../../../../../../../..]
[j-049:182867] MCW rank 3 bound to socket 1[core 15[hwt 0-1]]:
[../../../../../../../../../../../..][../../../BB/../../../../../../../..]
[j-049:182867] MCW rank 4 bound to socket 1[core 16[hwt 0-1]]:
[../../../../../../../../../../../..][../../../../BB/../../../../../../..]
[j-049:182867] MCW rank 5 bound to socket 1[core 17[hwt 0-1]]:
[../../../../../../../../../../../..][../../../../../BB/../../../../../..]
[j-049:182867] MCW rank 6 bound to socket 1[core 18[hwt 0-1]]:
[../../../../../../../../../../../..][../../../../../../BB/../../../../..]
[j-049:182867] MCW rank 7 bound to socket 1[core 19[hwt 0-1]]:
[../../../../../../../../../../../..][../../../../../../../BB/../../../..]



On Mon, Dec 21, 2015 at 11:40 AM, Matt Thompson  wrote:

> Ralph,
>
> Huh. That isn't in the Open MPI 1.8.8 mpirun man page. It is in Open MPI
> 1.10, so I'm guessing someone noticed it wasn't there. Explains why I
> didn't try it out. I'm assuming this option is respected on all nodes?
>
> Note: a SmarterManThanI™ here at Goddard thought up this:
>
> #!/bin/bash
> rank=0
> for node in $(srun uname -n | sort); do
> echo "rank $rank=$node slots=1:*"
> let rank+=1
> done
>
> It does seem to work in synthetic tests so I'm trying it now in my real
> job. I had to hack a few run scripts so I'll probably spend the next hour
> debugging something dumb I did.
>
> What I'm wondering about all this is: can this be done with --slot-list?
> Or, perhaps, does --slot-list even work?
>
> I have tried about 20 different variations of it, e.g., --slot-list 1:*,
> --slot-list '1:*', --slot-list 1:0,1,2,3,4,5,6,7, --slot-list
> 1:8,9,10,11,12,13,14,15, --slot-list 8-15, , and every time I seem to
> trigger an error via help-rmaps_rank_file.txt. I tried to read
> through opal_hwloc_base_slot_list_parse in the source, but my C isn't great
> (see my gmail address name) so that didn't help. Might not even be the
> right function, but I was just acking the code.
>
> Thanks,
> Matt
>
>
> On Mon, Dec 21, 2015 at 10:51 AM, Ralph Castain  wrote:
>
>> Try adding —cpu-set a,b,c,…  where the a,b,c… are the core id’s of your
>> second socket. I’m working on a cleaner option as this has come up before.
>>
>>
>> On Dec 21, 2015, at 5:29 AM, Matt Thompson  wrote:
>>
>> Dear Open MPI Gurus,
>>
>> I'm currently trying to do something with Open MPI 1.8.8 that I'm pretty
>> sure is possible, but I'm just not smart enough to figure out. Namely, I'm
>> seeing some odd GPU timings and I think it's because I was dumb and assumed
>> the GPU was on the PCI bus next to Socket #0 as some older GPU nodes I ran
>> on were like that.
>>
>> But, a trip through lspci and lstopo has shown me that the GPU is
>> actually on Socket #1. These are dual socket Sandy Bridge nodes and I'd
>> like to do some tests where I run a 8 processes per node and those
>> processes all land on Socket #1.
>>
>> So, what I'm trying to figure out is how to have Open MPI bind processes
>> like that. My first thought as always is to run a helloworld job with
>> -report-bindings on. I can manage to do this:
>>
>> (1061) $ mpirun -np 8 -report-bindings -map-by core ./helloWorld.exe
>> [borg01z205:16306] MCW rank 4 bound to socket 0[core 4[hwt 0]]:
>> [././././B/././.][./././././././.]
>> [borg01z205:16306] MCW rank 5 bound to socket 0[core 5[hwt 0]]:
>> [./././././B/./.][./././././././.]
>> [borg01z205:16306] MCW rank 6 bound to socket 0[core 6[hwt 0]]:
>> [././././././B/.][./././././././.]
>> [borg01z205:16306] MCW rank 7 bound to socket 0[core 7[hwt 0]]:
>> [./././././././B][./././././././.]
>> [borg01z205:16306] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
>> [B/././././././.][./././././././.]
>> [borg01z205:16306] MCW rank 1 bound to socket 0[core 1[hwt 0]]:
>> [./B/./././././.][./././././././.]
>> [borg01z205:16306] MCW rank 2 bound to socket 0[core 2[hwt 0]]:
>> [././B/././././.][./././././././.]
>> [borg01z205:16306] MCW rank 3 bound to socket 0[core 3[hwt 0]]:
>> [./././B/./././.][./././././././.]
>> Process7 of8 is on borg01z205
>> Process5 of8 is on borg01z205
>> Process2 of8 is on borg01z205
>> Process3 of8 is on borg01z205
>> Process4 of8 is on borg01z205
>> Process6 

Re: [OMPI users] Help with Binding in 1.8.8: Use only second socket

2015-12-21 Thread Matt Thompson
Ralph,

Huh. That isn't in the Open MPI 1.8.8 mpirun man page. It is in Open MPI
1.10, so I'm guessing someone noticed it wasn't there. Explains why I
didn't try it out. I'm assuming this option is respected on all nodes?

Note: a SmarterManThanI™ here at Goddard thought up this:

#!/bin/bash
rank=0
for node in $(srun uname -n | sort); do
echo "rank $rank=$node slots=1:*"
let rank+=1
done

It does seem to work in synthetic tests so I'm trying it now in my real
job. I had to hack a few run scripts so I'll probably spend the next hour
debugging something dumb I did.

What I'm wondering about all this is: can this be done with --slot-list?
Or, perhaps, does --slot-list even work?

I have tried about 20 different variations of it, e.g., --slot-list 1:*,
--slot-list '1:*', --slot-list 1:0,1,2,3,4,5,6,7, --slot-list
1:8,9,10,11,12,13,14,15, --slot-list 8-15, , and every time I seem to
trigger an error via help-rmaps_rank_file.txt. I tried to read
through opal_hwloc_base_slot_list_parse in the source, but my C isn't great
(see my gmail address name) so that didn't help. Might not even be the
right function, but I was just acking the code.

Thanks,
Matt


On Mon, Dec 21, 2015 at 10:51 AM, Ralph Castain  wrote:

> Try adding —cpu-set a,b,c,…  where the a,b,c… are the core id’s of your
> second socket. I’m working on a cleaner option as this has come up before.
>
>
> On Dec 21, 2015, at 5:29 AM, Matt Thompson  wrote:
>
> Dear Open MPI Gurus,
>
> I'm currently trying to do something with Open MPI 1.8.8 that I'm pretty
> sure is possible, but I'm just not smart enough to figure out. Namely, I'm
> seeing some odd GPU timings and I think it's because I was dumb and assumed
> the GPU was on the PCI bus next to Socket #0 as some older GPU nodes I ran
> on were like that.
>
> But, a trip through lspci and lstopo has shown me that the GPU is actually
> on Socket #1. These are dual socket Sandy Bridge nodes and I'd like to do
> some tests where I run a 8 processes per node and those processes all land
> on Socket #1.
>
> So, what I'm trying to figure out is how to have Open MPI bind processes
> like that. My first thought as always is to run a helloworld job with
> -report-bindings on. I can manage to do this:
>
> (1061) $ mpirun -np 8 -report-bindings -map-by core ./helloWorld.exe
> [borg01z205:16306] MCW rank 4 bound to socket 0[core 4[hwt 0]]:
> [././././B/././.][./././././././.]
> [borg01z205:16306] MCW rank 5 bound to socket 0[core 5[hwt 0]]:
> [./././././B/./.][./././././././.]
> [borg01z205:16306] MCW rank 6 bound to socket 0[core 6[hwt 0]]:
> [././././././B/.][./././././././.]
> [borg01z205:16306] MCW rank 7 bound to socket 0[core 7[hwt 0]]:
> [./././././././B][./././././././.]
> [borg01z205:16306] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
> [B/././././././.][./././././././.]
> [borg01z205:16306] MCW rank 1 bound to socket 0[core 1[hwt 0]]:
> [./B/./././././.][./././././././.]
> [borg01z205:16306] MCW rank 2 bound to socket 0[core 2[hwt 0]]:
> [././B/././././.][./././././././.]
> [borg01z205:16306] MCW rank 3 bound to socket 0[core 3[hwt 0]]:
> [./././B/./././.][./././././././.]
> Process7 of8 is on borg01z205
> Process5 of8 is on borg01z205
> Process2 of8 is on borg01z205
> Process3 of8 is on borg01z205
> Process4 of8 is on borg01z205
> Process6 of8 is on borg01z205
> Process0 of8 is on borg01z205
> Process1 of8 is on borg01z205
>
> Great...but wrong socket! Is there a way to tell it to use Socket 1
> instead?
>
> Note I'll be running under SLURM, so I will only have 8 processes per
> node, so it shouldn't need to use Socket 0.
> --
> Matt Thompson
>
> Man Among Men
> Fulcrum of History
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/12/28190.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/12/28195.php
>



-- 
Matt Thompson

Man Among Men
Fulcrum of History


Re: [OMPI users] Help with Binding in 1.8.8: Use only second socket

2015-12-21 Thread Ralph Castain
Try adding —cpu-set a,b,c,…  where the a,b,c… are the core id’s of your second 
socket. I’m working on a cleaner option as this has come up before.


> On Dec 21, 2015, at 5:29 AM, Matt Thompson  > wrote:
> 
> Dear Open MPI Gurus,
> 
> I'm currently trying to do something with Open MPI 1.8.8 that I'm pretty sure 
> is possible, but I'm just not smart enough to figure out. Namely, I'm seeing 
> some odd GPU timings and I think it's because I was dumb and assumed the GPU 
> was on the PCI bus next to Socket #0 as some older GPU nodes I ran on were 
> like that. 
> 
> But, a trip through lspci and lstopo has shown me that the GPU is actually on 
> Socket #1. These are dual socket Sandy Bridge nodes and I'd like to do some 
> tests where I run a 8 processes per node and those processes all land on 
> Socket #1.
> 
> So, what I'm trying to figure out is how to have Open MPI bind processes like 
> that. My first thought as always is to run a helloworld job with 
> -report-bindings on. I can manage to do this:
> 
> (1061) $ mpirun -np 8 -report-bindings -map-by core ./helloWorld.exe
> [borg01z205:16306] MCW rank 4 bound to socket 0[core 4[hwt 0]]: 
> [././././B/././.][./././././././.]
> [borg01z205:16306] MCW rank 5 bound to socket 0[core 5[hwt 0]]: 
> [./././././B/./.][./././././././.]
> [borg01z205:16306] MCW rank 6 bound to socket 0[core 6[hwt 0]]: 
> [././././././B/.][./././././././.]
> [borg01z205:16306] MCW rank 7 bound to socket 0[core 7[hwt 0]]: 
> [./././././././B][./././././././.]
> [borg01z205:16306] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
> [B/././././././.][./././././././.]
> [borg01z205:16306] MCW rank 1 bound to socket 0[core 1[hwt 0]]: 
> [./B/./././././.][./././././././.]
> [borg01z205:16306] MCW rank 2 bound to socket 0[core 2[hwt 0]]: 
> [././B/././././.][./././././././.]
> [borg01z205:16306] MCW rank 3 bound to socket 0[core 3[hwt 0]]: 
> [./././B/./././.][./././././././.]
> Process7 of8 is on borg01z205
> Process5 of8 is on borg01z205
> Process2 of8 is on borg01z205
> Process3 of8 is on borg01z205
> Process4 of8 is on borg01z205
> Process6 of8 is on borg01z205
> Process0 of8 is on borg01z205
> Process1 of8 is on borg01z205
> 
> Great...but wrong socket! Is there a way to tell it to use Socket 1 instead? 
> 
> Note I'll be running under SLURM, so I will only have 8 processes per node, 
> so it shouldn't need to use Socket 0.
> -- 
> Matt Thompson
> Man Among Men
> Fulcrum of History
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/12/28190.php



Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-30 Thread Jeff Squyres (jsquyres)
On Nov 24, 2015, at 9:31 AM, Dave Love  wrote:
> 
>> btw, we already use the force, thanks to the ob1 pml and the yoda spml
> 
> I think that's assuming familiarity with something which leaves out some
> people...

FWIW, I agree: we use unhelpful names for components in Open MPI.  What Gilles 
is specifically referring to here is that there are several Star Wars-based 
names of plugins in Open MPI.  They mean something to us developers (they 
started off as a funny joke), but they mean little/nothing to end users.

I actually specifically called out this issue in the SC'15 Open MPI BOF:

http://image.slidesharecdn.com/ompi-bof-2015-for-web-151130155610-lva1-app6891/95/open-mpi-sc15-state-of-the-union-bof-28-638.jpg?cb=1448898995

This is definitely an issue that is on the agenda for the face-to-face Open MPI 
developer's meeting in February 
(https://github.com/open-mpi/ompi/wiki/Meeting-2016-02).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-24 Thread Dave Love
Gilles Gouaillardet  writes:

> Currently, ompi create a file in the temporary directory and then mmap it.
> an obvious requirement is the temporary directory must have enough free
> space for that file.
> (this might be an issue on some disk less nodes)
>
>
> a simple alternative could be to try /tmp, and if there is not enough
> space, try /dev/shm
> (unless the tmpdir has been set explicitly)
>
> any thought ?

/tmp is already the default if TMPDIR et al aren't defined, isn't it?

While you may not have any choice to use /dev/shm on a diskless node, it
doesn't seem a good thing to do by default for large maps.  It wasn't
here.

[I've never been sure of the semantics of mmap over tmpfs.]

I think the important thing is clear explanation of any error, and
suggestions for workarounds.  Presumably anyone operating diskless nodes
has made arrangements for this sort of thing.

> Gilles
>
> btw, we already use the force, thanks to the ob1 pml and the yoda spml

I think that's assuming familiarity with something which leaves out some
people...


Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-20 Thread Gilles Gouaillardet
Currently, ompi create a file in the temporary directory and then mmap it.
an obvious requirement is the temporary directory must have enough free
space for that file.
(this might be an issue on some disk less nodes)


a simple alternative could be to try /tmp, and if there is not enough
space, try /dev/shm
(unless the tmpdir has been set explicitly)

any thought ?

Gilles

btw, we already use the force, thanks to the ob1 pml and the yoda spml

On Friday, November 20, 2015, Dave Love  wrote:

> Jeff Hammond > writes:
>
> >> Doesn't mpich have the option to use sysv memory?  You may want to try
> that
> >>
> >>
> > MPICH?  Look, I may have earned my way onto Santa's naughty list more
> than
> > a few times, but at least I have the decency not to post MPICH questions
> to
> > the Open-MPI list ;-)
> >
> > If there is a way to tell Open-MPI to use shm_open without filesystem
> > backing (if that is even possible) at configure time, I'd love to do
> that.
>
> I'm not sure I understand what's required, but is this what you're after?
>
>   $ ompi_info --param shmem all -l 9|grep priority
>  MCA shmem: parameter "shmem_mmap_priority" (current
> value: "50", data source: default, level: 3 user/all, type: int)
>  MCA shmem: parameter "shmem_posix_priority" (current
> value: "40", data source: default, level: 3 user/all, type: int)
>  MCA shmem: parameter "shmem_sysv_priority" (current
> value: "30", data source: default, level: 3 user/all, type: int)
>
> >> In the spirit OMPI - may the force be with you.
> >>
> >>
> > All I will say here is that Open-MPI has a Vader BTL :-)
>
> Whatever that might mean.
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/11/28084.php
>


Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-20 Thread Dave Love
Jeff Hammond  writes:

>> Doesn't mpich have the option to use sysv memory?  You may want to try that
>>
>>
> MPICH?  Look, I may have earned my way onto Santa's naughty list more than
> a few times, but at least I have the decency not to post MPICH questions to
> the Open-MPI list ;-)
>
> If there is a way to tell Open-MPI to use shm_open without filesystem
> backing (if that is even possible) at configure time, I'd love to do that.

I'm not sure I understand what's required, but is this what you're after?

  $ ompi_info --param shmem all -l 9|grep priority
 MCA shmem: parameter "shmem_mmap_priority" (current value: 
"50", data source: default, level: 3 user/all, type: int)
 MCA shmem: parameter "shmem_posix_priority" (current value: 
"40", data source: default, level: 3 user/all, type: int)
 MCA shmem: parameter "shmem_sysv_priority" (current value: 
"30", data source: default, level: 3 user/all, type: int)

>> In the spirit OMPI - may the force be with you.
>>
>>
> All I will say here is that Open-MPI has a Vader BTL :-)

Whatever that might mean.


Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-20 Thread Dave Love
[There must be someone better to answer this, but since I've seen it:]

Jeff Hammond  writes:

> I have no idea what this is trying to tell me.  Help?
>
> jhammond@nid00024:~/MPI/qoit/collectives> mpirun -n 2 ./driver.x 64
> [nid00024:00482] [[46168,0],0] ORTE_ERROR_LOG: Not found in file
> ../../../../../orte/mca/plm/alps/plm_alps_module.c at line 418

That must be a system error message, presumably indicating why the
process couldn't be launched; it's not in the OMPI source.

> I can run the same job with srun without incident:
>
> jhammond@nid00024:~/MPI/qoit/collectives> srun -n 2 ./driver.x 64
> MPI was initialized.
>
> This is on the NERSC Cori Cray XC40 system.  I build Open-MPI git head from
> source for OFI libfabric.
>
> I have many other issues, which I will report later.  As a spoiler, if I
> cannot use your mpirun, I cannot set any of the MCA options there.  Is
> there a method to set MCA options with environment variables?  I could not
> find this documented anywhere.

mpirun(1) documents the mechanisms under "Setting MCA Parameters",
unless it's changed since 1.8.  [I have wondered why a file in cwd isn't
a possibility, only in $HOME.]


Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-19 Thread Jeff Hammond
On Thu, Nov 19, 2015 at 4:11 PM, Howard Pritchard 
wrote:

> Hi Jeff H.
>
> Why don't you just try configuring with
>
> ./configure --prefix=my_favorite_install_dir
> --with-libfabric=install_dir_for_libfabric
> make -j 8 install
>
> and see what happens?
>
>
That was the first thing I tried.  However, it seemed to give me a
Verbs-oriented build, and Verbs is the Sith lord to us JedOFIs :-)

>From aforementioned Wiki:

../configure \
 --with-libfabric=$HOME/OFI/install-ofi-gcc-gni-cori \
 --disable-shared \
 --prefix=$HOME/MPI/install-ompi-ofi-gcc-gni-cori

Unfortunately, this (above) leads to an mpicc that indicates support for IB
Verbs, not OFI.
I will try again though just in case.


> Make sure before you configure that you have PrgEnv-gnu or PrgEnv-intel
> module loaded.
>
>
Yeah, I know better than to use the Cray compilers for such things (e.g.
https://github.com/jeffhammond/OpenPA/commit/965ca014ea3148ee5349e16d2cec1024271a7415
)


> Those were the configure/compiler options I used to do testing of ofi mtl
> on cori.
>
> Jeff S. - this thread has gotten intermingled with mpich setup as well,
> hence
> the suggestion for the mpich shm mechanism.
>
>
The first OSS implementation of MPI that I can use on Cray XC using OFI
gets a prize at the December MPI Forum.

Best,

Jeff



> Howard
>
>
>
> 2015-11-19 16:59 GMT-07:00 Jeff Hammond :
>
>>
>>> How did you configure for Cori?  You need to be using the slurm plm
>>> component for that system.  I know this sounds like gibberish.
>>>
>>>
>> ../configure --with-libfabric=$HOME/OFI/install-ofi-gcc-gni-cori \
>>  --enable-mca-static=mtl-ofi \
>>  --enable-mca-no-build=btl-openib,btl-vader,btl-ugni,btl-tcp \
>>  --enable-static --disable-shared --disable-dlopen \
>>  --prefix=$HOME/MPI/install-ompi-ofi-gcc-gni-xpmem-cori \
>>  --with-cray-pmi --with-alps --with-cray-xpmem --with-slurm \
>>  --without-verbs --without-fca --without-mxm --without-ucx \
>>  --without-portals4 --without-psm --without-psm2 \
>>  --without-udreg --without-ugni --without-munge \
>>  --without-sge --without-loadleveler --without-tm --without-lsf \
>>  --without-pvfs2 --without-plfs \
>>  --without-cuda --disable-oshmem \
>>  --disable-mpi-fortran --disable-oshmem-fortran \
>>  LDFLAGS="-L/opt/cray/ugni/default/lib64 -lugni \
>>   -L/opt/cray/alps/default/lib64 -lalps -lalpslli -lalpsutil \   
>>-ldl -lrt"
>>
>>
>> This is copied from
>> https://github.com/jeffhammond/HPCInfo/blob/master/ofi/README.md#open-mpi,
>> which I note in case you want to see what changes I've made at any point in
>> the future.
>>
>>
>>> There should be a with-slurm configure option to pick up this component.
>>>
>>> Indeed there is.
>>
>>
>>> Doesn't mpich have the option to use sysv memory?  You may want to try
>>> that
>>>
>>>
>> MPICH?  Look, I may have earned my way onto Santa's naughty list more
>> than a few times, but at least I have the decency not to post MPICH
>> questions to the Open-MPI list ;-)
>>
>> If there is a way to tell Open-MPI to use shm_open without filesystem
>> backing (if that is even possible) at configure time, I'd love to do that.
>>
>>
>>> Oh for tuning params you can use env variables.  For example lets say
>>> rather than using the gni provider in ofi mtl you want to try sockets. Then
>>> do
>>>
>>> Export OMPI_MCA_mtl_ofi_provider_include=sockets
>>>
>>>
>> Thanks.  I'm glad that there is an option to set them this way.
>>
>>
>>> In the spirit OMPI - may the force be with you.
>>>
>>>
>> All I will say here is that Open-MPI has a Vader BTL :-)
>>
>>>
>>> > On Thu 19.11.2015 09:44:20 Jeff Hammond wrote:
>>> > > I have no idea what this is trying to tell me. Help?
>>> > >
>>> > > jhammond@nid00024:~/MPI/qoit/collectives> mpirun -n 2 ./driver.x 64
>>> > > [nid00024:00482] [[46168,0],0] ORTE_ERROR_LOG: Not found in file
>>> > > ../../../../../orte/mca/plm/alps/plm_alps_module.c at line 418
>>> > >
>>> > > I can run the same job with srun without incident:
>>> > >
>>> > > jhammond@nid00024:~/MPI/qoit/collectives> srun -n 2 ./driver.x 64
>>> > > MPI was initialized.
>>> > >
>>> > > This is on the NERSC Cori Cray XC40 system. I build Open-MPI git
>>> head from
>>> > > source for OFI libfabric.
>>> > >
>>> > > I have many other issues, which I will report later. As a spoiler,
>>> if I
>>> > > cannot use your mpirun, I cannot set any of the MCA options there. Is
>>> > > there a method to set MCA options with environment variables? I
>>> could not
>>> > > find this documented anywhere.
>>> > >
>>> > > In particular, is there a way to cause shm to not use the global
>>> > > filesystem? I see this issue comes up a lot and I read the list
>>> archives,
>>> > > but the warning message (
>>> > >
>>> 

Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-19 Thread Howard Pritchard
Hi Jeff,

I finally got an allocation on cori - its one busy machine.

Anyway, using the ompi i'd built on edison with the above recommended
configure options
I was able to run using either srun or mpirun on cori provided that in the
later case I used

mpirun -np X -N Y --mca plm slurm ./my_favorite_app

I will make an adjustment to the alps plm launcher to disqualify itself if
the wlm_detect
facility on the cray reports that srun is the launcher.  That's a minor fix
and should make
it in to v2.x in a week or so.  It will be a runtime selection so you only
have to build ompi
once for use either on edison or cori.

Howard


2015-11-19 17:11 GMT-07:00 Howard Pritchard :

> Hi Jeff H.
>
> Why don't you just try configuring with
>
> ./configure --prefix=my_favorite_install_dir
> --with-libfabric=install_dir_for_libfabric
> make -j 8 install
>
> and see what happens?
>
> Make sure before you configure that you have PrgEnv-gnu or PrgEnv-intel
> module loaded.
>
> Those were the configure/compiler options I used to do testing of ofi mtl
> on cori.
>
> Jeff S. - this thread has gotten intermingled with mpich setup as well,
> hence
> the suggestion for the mpich shm mechanism.
>
>
> Howard
>
>
>
> 2015-11-19 16:59 GMT-07:00 Jeff Hammond :
>
>>
>>> How did you configure for Cori?  You need to be using the slurm plm
>>> component for that system.  I know this sounds like gibberish.
>>>
>>>
>> ../configure --with-libfabric=$HOME/OFI/install-ofi-gcc-gni-cori \
>>  --enable-mca-static=mtl-ofi \
>>  --enable-mca-no-build=btl-openib,btl-vader,btl-ugni,btl-tcp \
>>  --enable-static --disable-shared --disable-dlopen \
>>  --prefix=$HOME/MPI/install-ompi-ofi-gcc-gni-xpmem-cori \
>>  --with-cray-pmi --with-alps --with-cray-xpmem --with-slurm \
>>  --without-verbs --without-fca --without-mxm --without-ucx \
>>  --without-portals4 --without-psm --without-psm2 \
>>  --without-udreg --without-ugni --without-munge \
>>  --without-sge --without-loadleveler --without-tm --without-lsf \
>>  --without-pvfs2 --without-plfs \
>>  --without-cuda --disable-oshmem \
>>  --disable-mpi-fortran --disable-oshmem-fortran \
>>  LDFLAGS="-L/opt/cray/ugni/default/lib64 -lugni \
>>   -L/opt/cray/alps/default/lib64 -lalps -lalpslli -lalpsutil \   
>>-ldl -lrt"
>>
>>
>> This is copied from
>> https://github.com/jeffhammond/HPCInfo/blob/master/ofi/README.md#open-mpi,
>> which I note in case you want to see what changes I've made at any point in
>> the future.
>>
>>
>>> There should be a with-slurm configure option to pick up this component.
>>>
>>> Indeed there is.
>>
>>
>>> Doesn't mpich have the option to use sysv memory?  You may want to try
>>> that
>>>
>>>
>> MPICH?  Look, I may have earned my way onto Santa's naughty list more
>> than a few times, but at least I have the decency not to post MPICH
>> questions to the Open-MPI list ;-)
>>
>> If there is a way to tell Open-MPI to use shm_open without filesystem
>> backing (if that is even possible) at configure time, I'd love to do that.
>>
>>
>>> Oh for tuning params you can use env variables.  For example lets say
>>> rather than using the gni provider in ofi mtl you want to try sockets. Then
>>> do
>>>
>>> Export OMPI_MCA_mtl_ofi_provider_include=sockets
>>>
>>>
>> Thanks.  I'm glad that there is an option to set them this way.
>>
>>
>>> In the spirit OMPI - may the force be with you.
>>>
>>>
>> All I will say here is that Open-MPI has a Vader BTL :-)
>>
>>>
>>> > On Thu 19.11.2015 09:44:20 Jeff Hammond wrote:
>>> > > I have no idea what this is trying to tell me. Help?
>>> > >
>>> > > jhammond@nid00024:~/MPI/qoit/collectives> mpirun -n 2 ./driver.x 64
>>> > > [nid00024:00482] [[46168,0],0] ORTE_ERROR_LOG: Not found in file
>>> > > ../../../../../orte/mca/plm/alps/plm_alps_module.c at line 418
>>> > >
>>> > > I can run the same job with srun without incident:
>>> > >
>>> > > jhammond@nid00024:~/MPI/qoit/collectives> srun -n 2 ./driver.x 64
>>> > > MPI was initialized.
>>> > >
>>> > > This is on the NERSC Cori Cray XC40 system. I build Open-MPI git
>>> head from
>>> > > source for OFI libfabric.
>>> > >
>>> > > I have many other issues, which I will report later. As a spoiler,
>>> if I
>>> > > cannot use your mpirun, I cannot set any of the MCA options there. Is
>>> > > there a method to set MCA options with environment variables? I
>>> could not
>>> > > find this documented anywhere.
>>> > >
>>> > > In particular, is there a way to cause shm to not use the global
>>> > > filesystem? I see this issue comes up a lot and I read the list
>>> archives,
>>> > > but the warning message (
>>> > >
>>> https://github.com/hpc/cce-mpi-openmpi-1.6.4/blob/master/ompi/mca/common/sm/
>>> > > help-mpi-common-sm.txt) suggested that I could override it by
>>> setting TMP,
>>> 

Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-19 Thread Jeff Hammond
>
>
> How did you configure for Cori?  You need to be using the slurm plm
> component for that system.  I know this sounds like gibberish.
>
>
../configure --with-libfabric=$HOME/OFI/install-ofi-gcc-gni-cori \
 --enable-mca-static=mtl-ofi \
 --enable-mca-no-build=btl-openib,btl-vader,btl-ugni,btl-tcp \
 --enable-static --disable-shared --disable-dlopen \
 --prefix=$HOME/MPI/install-ompi-ofi-gcc-gni-xpmem-cori \
 --with-cray-pmi --with-alps --with-cray-xpmem --with-slurm \
 --without-verbs --without-fca --without-mxm --without-ucx \
 --without-portals4 --without-psm --without-psm2 \
 --without-udreg --without-ugni --without-munge \
 --without-sge --without-loadleveler --without-tm --without-lsf \
 --without-pvfs2 --without-plfs \
 --without-cuda --disable-oshmem \
 --disable-mpi-fortran --disable-oshmem-fortran \
 LDFLAGS="-L/opt/cray/ugni/default/lib64 -lugni \
-L/opt/cray/alps/default/lib64 -lalps -lalpslli -lalpsutil
\  -ldl -lrt"


This is copied from
https://github.com/jeffhammond/HPCInfo/blob/master/ofi/README.md#open-mpi,
which I note in case you want to see what changes I've made at any point in
the future.


> There should be a with-slurm configure option to pick up this component.
>
> Indeed there is.


> Doesn't mpich have the option to use sysv memory?  You may want to try that
>
>
MPICH?  Look, I may have earned my way onto Santa's naughty list more than
a few times, but at least I have the decency not to post MPICH questions to
the Open-MPI list ;-)

If there is a way to tell Open-MPI to use shm_open without filesystem
backing (if that is even possible) at configure time, I'd love to do that.


> Oh for tuning params you can use env variables.  For example lets say
> rather than using the gni provider in ofi mtl you want to try sockets. Then
> do
>
> Export OMPI_MCA_mtl_ofi_provider_include=sockets
>
>
Thanks.  I'm glad that there is an option to set them this way.


> In the spirit OMPI - may the force be with you.
>
>
All I will say here is that Open-MPI has a Vader BTL :-)

>
> > On Thu 19.11.2015 09:44:20 Jeff Hammond wrote:
> > > I have no idea what this is trying to tell me. Help?
> > >
> > > jhammond@nid00024:~/MPI/qoit/collectives> mpirun -n 2 ./driver.x 64
> > > [nid00024:00482] [[46168,0],0] ORTE_ERROR_LOG: Not found in file
> > > ../../../../../orte/mca/plm/alps/plm_alps_module.c at line 418
> > >
> > > I can run the same job with srun without incident:
> > >
> > > jhammond@nid00024:~/MPI/qoit/collectives> srun -n 2 ./driver.x 64
> > > MPI was initialized.
> > >
> > > This is on the NERSC Cori Cray XC40 system. I build Open-MPI git head
> from
> > > source for OFI libfabric.
> > >
> > > I have many other issues, which I will report later. As a spoiler, if I
> > > cannot use your mpirun, I cannot set any of the MCA options there. Is
> > > there a method to set MCA options with environment variables? I could
> not
> > > find this documented anywhere.
> > >
> > > In particular, is there a way to cause shm to not use the global
> > > filesystem? I see this issue comes up a lot and I read the list
> archives,
> > > but the warning message (
> > >
> https://github.com/hpc/cce-mpi-openmpi-1.6.4/blob/master/ompi/mca/common/sm/
> > > help-mpi-common-sm.txt) suggested that I could override it by setting
> TMP,
> > > TEMP or TEMPDIR, which I did to no avail.
> >
> > From my experience on edison: the one environment variable that does
> works is TMPDIR - the one that is not listed in the error message :-)
>

That's great.  I will try that now.  Is there a Github issue open already
to fix that documentation?  If not...


> > Can't help you with your mpirun problem though ...
>
> No worries.  I appreciate all the help I can get.

Thanks,

Jeff

-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/


Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-19 Thread Howard
Hi Jeff

How did you configure for Cori?  You need to be using the slurm plm component 
for that system.  I know this sounds like gibberish.  

There should be a with-slurm configure option to pick up this component. 

Doesn't mpich have the option to use sysv memory?  You may want to try that

Oh for tuning params you can use env variables.  For example lets say rather 
than using the gni provider in ofi mtl you want to try sockets. Then do

Export OMPI_MCA_mtl_ofi_provider_include=sockets

In the spirit OMPI - may the force be with you.   

Howard 

Von meinem iPhone gesendet

> Am 19.11.2015 um 11:51 schrieb Martin Siegert :
> 
> Hi Jeff,
>  
> On Thu 19.11.2015 09:44:20 Jeff Hammond wrote:
> > I have no idea what this is trying to tell me. Help?
> >
> > jhammond@nid00024:~/MPI/qoit/collectives> mpirun -n 2 ./driver.x 64
> > [nid00024:00482] [[46168,0],0] ORTE_ERROR_LOG: Not found in file
> > ../../../../../orte/mca/plm/alps/plm_alps_module.c at line 418
> >
> > I can run the same job with srun without incident:
> >
> > jhammond@nid00024:~/MPI/qoit/collectives> srun -n 2 ./driver.x 64
> > MPI was initialized.
> >
> > This is on the NERSC Cori Cray XC40 system. I build Open-MPI git head from
> > source for OFI libfabric.
> >
> > I have many other issues, which I will report later. As a spoiler, if I
> > cannot use your mpirun, I cannot set any of the MCA options there. Is
> > there a method to set MCA options with environment variables? I could not
> > find this documented anywhere.
> >
> > In particular, is there a way to cause shm to not use the global
> > filesystem? I see this issue comes up a lot and I read the list archives,
> > but the warning message (
> > https://github.com/hpc/cce-mpi-openmpi-1.6.4/blob/master/ompi/mca/common/sm/
> > help-mpi-common-sm.txt) suggested that I could override it by setting TMP,
> > TEMP or TEMPDIR, which I did to no avail.
>  
> From my experience on edison: the one environment variable that does works is 
> TMPDIR - the one that is not listed in the error message :-)
>  
> Can't help you with your mpirun problem though ...
>  
> Cheers,
> Martin
>  
> --
> Martin Siegert
> Head, Research Computing
> WestGrid/ComputeCanada Site Lead
> Simon Fraser University
> Burnaby, British Columbia
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/11/28063.php


Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-19 Thread Martin Siegert
Hi Jeff,

On Thu 19.11.2015 09:44:20 Jeff Hammond wrote:
> I have no idea what this is trying to tell me.  Help?
> 
> jhammond@nid00024:~/MPI/qoit/collectives> mpirun -n 2 ./driver.x 64
> [nid00024:00482] [[46168,0],0] ORTE_ERROR_LOG: Not found in file
> ../../../../../orte/mca/plm/alps/plm_alps_module.c at line 418
> 
> I can run the same job with srun without incident:
> 
> jhammond@nid00024:~/MPI/qoit/collectives> srun -n 2 ./driver.x 64
> MPI was initialized.
> 
> This is on the NERSC Cori Cray XC40 system.  I build Open-MPI git head 
from
> source for OFI libfabric.
> 
> I have many other issues, which I will report later.  As a spoiler, if I
> cannot use your mpirun, I cannot set any of the MCA options there.  Is
> there a method to set MCA options with environment variables?  I could 
not
> find this documented anywhere.
> 
> In particular, is there a way to cause shm to not use the global
> filesystem?  I see this issue comes up a lot and I read the list archives,
> but the warning message (
> https://github.com/hpc/cce-mpi-openmpi-1.6.4/blob/master/ompi/mca/common/sm/
> help-mpi-common-sm.txt) suggested that I could override it by setting 
TMP,
> TEMP or TEMPDIR, which I did to no avail.

>From my experience on edison: the one environment variable that does 
works is TMPDIR - the one that is not listed in the error message :-)

Can't help you with your mpirun problem though ...

Cheers,
Martin

-- 
Martin Siegert
Head, Research Computing
WestGrid/ComputeCanada Site Lead
Simon Fraser University
Burnaby, British Columbia


Re: [OMPI users] Help with Specific Binding

2015-09-13 Thread Ralph Castain
Checkout the man page “OMPI_Affinity_str” for an MPI extension that might help


> On Sep 13, 2015, at 7:28 AM, Saliya Ekanayake  wrote:
> 
> Thank you, I'll try this. Also, is there a way to know which core a process 
> is bound to within the program other than executing something like taskset 
>  from program?
> 
> On Sun, Sep 13, 2015 at 10:05 AM, Ralph Castain  > wrote:
> Actually, the error was correct - it was me that was incorrect. The correct 
> set of options would be:
> 
> —map-by ppr:12_node —bind-to core —cpu-set=0,2,4,…
> 
> Sorry about the confusion
> 
> 
>> On Sep 13, 2015, at 2:43 AM, Ralph Castain > > wrote:
>> 
>> The rankfile will certainly do it, but that error is a bug and I’ll have to 
>> fix it.
>> 
>> 
>>> On Sep 13, 2015, at 1:10 AM, Saliya Ekanayake >> > wrote:
>>> 
>>> I could get it working by manually generating a rankfile all the ranks and 
>>> not using any --map-by options.
>>> 
>>> I'll try the --map-by core as well
>>> 
>>> On Sun, Sep 13, 2015 at 3:59 AM, Tobias Kloeffel >> > wrote:
>>> Hi,
>>> use: --map-by core
>>> 
>>> regards,
>>> Tobias
>>> 
>>> 
>>> On 09/13/2015 09:41 AM, Saliya Ekanayake wrote:
 I tried,
 
  --map-by ppr:12:node --slot-list 0,2,4,6,8,10,12,14,16,18,20,22 --bind-to 
 core -np 12
 
 but it complains,
 
 "Conflicting directives for binding policy are causing the policy
 to be redefined:
 
   New policy:   socket
   Prior policy:  CORE
 
 Please check that only one policy is defined.
 "
 
 On Sun, Sep 13, 2015 at 2:57 AM, Ralph Castain > wrote:
 Try something like this instead:
 
 —map-by ppr:12:node —bind-to core —slot-list=0,2,4,6,8,…
 
 You’ll have to play a bit with the core numbers in the slot-list to get 
 the numbering right as I don’t know how your machine numbers them, and I 
 can’t guarantee it will work - but it’s worth a shot. If it doesn’t, then 
 I may have to add an option for such purposes
 
 Ralph
 
> On Sep 12, 2015, at 7:39 PM, Saliya Ekanayake  > wrote:
> 
> Hi,
> 
> We've a machine as in the following picture. I'd like to run 12 MPI procs 
> per node each bound to 1 core, but like shown in blue dots in the 
> pictures. I can use the following command to run 12 procs per node, but 
> PE=1 makes all the 12 processes will run in just 1 socket. PE=2 will make 
> a process bind to 2 cores, which is not what I want. 
> 
> --map-by ppr:12:node:PE=1,SPAN
> 
> Thank you,
> Saliya
> 
> 
> 
> -- 
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914 
>  http://saliya.org 
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/09/27558.php 
> 
 
 ___
 users mailing list
 us...@open-mpi.org 
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
 
 Link to this post: 
 http://www.open-mpi.org/community/lists/users/2015/09/27559.php 
 
 
 
 
 -- 
 Saliya Ekanayake
 Ph.D. Candidate | Research Assistant
 School of Informatics and Computing | Digital Science Center
 Indiana University, Bloomington
 Cell 812-391-4914 
 http://saliya.org 
 
 ___
 users mailing list
 us...@open-mpi.org 
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
 
 Link to this post: 
 http://www.open-mpi.org/community/lists/users/2015/09/27560.php 
 
>>> -- 
>>> M.Sc. Tobias Klöffel
>>> ===
>>> Interdisciplinary Center for Molecular Materials (ICMM)
>>> and Computer-Chemistry-Center (CCC)
>>> Department Chemie und Pharmazie
>>> Friedrich-Alexander-Universität 

Re: [OMPI users] Help with Specific Binding

2015-09-13 Thread Gilles Gouaillardet
on linux, you can look at /proc/self/status and search allowed_cpus_list
or you can use the sched_getaffinity system call

note that in some (hopefully rare)cases, this will return different results
than hwloc

On Sunday, September 13, 2015, Saliya Ekanayake  wrote:

> Thank you, I'll try this. Also, is there a way to know which core a
> process is bound to within the program other than executing something like
> taskset  from program?
>
> On Sun, Sep 13, 2015 at 10:05 AM, Ralph Castain  > wrote:
>
>> Actually, the error was correct - it was me that was incorrect. The
>> correct set of options would be:
>>
>> —map-by ppr:12_node —bind-to core —cpu-set=0,2,4,…
>>
>> Sorry about the confusion
>>
>>
>> On Sep 13, 2015, at 2:43 AM, Ralph Castain > > wrote:
>>
>> The rankfile will certainly do it, but that error is a bug and I’ll have
>> to fix it.
>>
>>
>> On Sep 13, 2015, at 1:10 AM, Saliya Ekanayake > > wrote:
>>
>> I could get it working by manually generating a rankfile all the ranks
>> and not using any --map-by options.
>>
>> I'll try the --map-by core as well
>>
>> On Sun, Sep 13, 2015 at 3:59 AM, Tobias Kloeffel > > wrote:
>>
>>> Hi,
>>> use: --map-by core
>>>
>>> regards,
>>> Tobias
>>>
>>>
>>> On 09/13/2015 09:41 AM, Saliya Ekanayake wrote:
>>>
>>> I tried,
>>>
>>>  --map-by ppr:12:node --slot-list 0,2,4,6,8,10,12,14,16,18,20,22
>>> --bind-to core -np 12
>>>
>>> but it complains,
>>>
>>> "Conflicting directives for binding policy are causing the policy
>>> to be redefined:
>>>
>>>   New policy:   socket
>>>   Prior policy:  CORE
>>>
>>> Please check that only one policy is defined.
>>> "
>>>
>>> On Sun, Sep 13, 2015 at 2:57 AM, Ralph Castain >> > wrote:
>>>
 Try something like this instead:

 —map-by ppr:12:node —bind-to core —slot-list=0,2,4,6,8,…

 You’ll have to play a bit with the core numbers in the slot-list to get
 the numbering right as I don’t know how your machine numbers them, and I
 can’t guarantee it will work - but it’s worth a shot. If it doesn’t, then I
 may have to add an option for such purposes

 Ralph

 On Sep 12, 2015, at 7:39 PM, Saliya Ekanayake > wrote:

 Hi,

 We've a machine as in the following picture. I'd like to run 12 MPI
 procs per node each bound to 1 core, but like shown in blue dots in the
 pictures. I can use the following command to run 12 procs per node, but
 PE=1 makes all the 12 processes will run in just 1 socket. PE=2 will make a
 process bind to 2 cores, which is not what I want.

 --map-by ppr:12:node:PE=1,SPAN

 Thank you,
 Saliya

 

 --
 Saliya Ekanayake
 Ph.D. Candidate | Research Assistant
 School of Informatics and Computing | Digital Science Center
 Indiana University, Bloomington
 Cell 812-391-4914
 http://saliya.org
 ___
 users mailing list
 us...@open-mpi.org 
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post:
 http://www.open-mpi.org/community/lists/users/2015/09/27558.php



 ___
 users mailing list
 us...@open-mpi.org 
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post:
 http://www.open-mpi.org/community/lists/users/2015/09/27559.php

>>>
>>>
>>>
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>> Cell 812-391-4914
>>> http://saliya.org
>>>
>>>
>>> ___
>>> users mailing listus...@open-mpi.org 
>>> 
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/09/27560.php
>>>
>>>
>>> --
>>> M.Sc. Tobias Klöffel
>>> ===
>>> Interdisciplinary Center for Molecular Materials (ICMM)
>>> and Computer-Chemistry-Center (CCC)
>>> Department Chemie und Pharmazie
>>> Friedrich-Alexander-Universität Erlangen-Nürnberg
>>> Nägelsbachstr. 25
>>> D-91052 Erlangen, Germany
>>>
>>> Room: 2.307
>>> Phone: +49 (0) 9131 / 85 - 20421
>>> Fax: +49 (0) 9131 / 85 - 26565
>>>
>>> 

Re: [OMPI users] Help with Specific Binding

2015-09-13 Thread Saliya Ekanayake
Thank you, I'll try this. Also, is there a way to know which core a process
is bound to within the program other than executing something like taskset
 from program?

On Sun, Sep 13, 2015 at 10:05 AM, Ralph Castain  wrote:

> Actually, the error was correct - it was me that was incorrect. The
> correct set of options would be:
>
> —map-by ppr:12_node —bind-to core —cpu-set=0,2,4,…
>
> Sorry about the confusion
>
>
> On Sep 13, 2015, at 2:43 AM, Ralph Castain  wrote:
>
> The rankfile will certainly do it, but that error is a bug and I’ll have
> to fix it.
>
>
> On Sep 13, 2015, at 1:10 AM, Saliya Ekanayake  wrote:
>
> I could get it working by manually generating a rankfile all the ranks and
> not using any --map-by options.
>
> I'll try the --map-by core as well
>
> On Sun, Sep 13, 2015 at 3:59 AM, Tobias Kloeffel 
> wrote:
>
>> Hi,
>> use: --map-by core
>>
>> regards,
>> Tobias
>>
>>
>> On 09/13/2015 09:41 AM, Saliya Ekanayake wrote:
>>
>> I tried,
>>
>>  --map-by ppr:12:node --slot-list 0,2,4,6,8,10,12,14,16,18,20,22
>> --bind-to core -np 12
>>
>> but it complains,
>>
>> "Conflicting directives for binding policy are causing the policy
>> to be redefined:
>>
>>   New policy:   socket
>>   Prior policy:  CORE
>>
>> Please check that only one policy is defined.
>> "
>>
>> On Sun, Sep 13, 2015 at 2:57 AM, Ralph Castain  wrote:
>>
>>> Try something like this instead:
>>>
>>> —map-by ppr:12:node —bind-to core —slot-list=0,2,4,6,8,…
>>>
>>> You’ll have to play a bit with the core numbers in the slot-list to get
>>> the numbering right as I don’t know how your machine numbers them, and I
>>> can’t guarantee it will work - but it’s worth a shot. If it doesn’t, then I
>>> may have to add an option for such purposes
>>>
>>> Ralph
>>>
>>> On Sep 12, 2015, at 7:39 PM, Saliya Ekanayake  wrote:
>>>
>>> Hi,
>>>
>>> We've a machine as in the following picture. I'd like to run 12 MPI
>>> procs per node each bound to 1 core, but like shown in blue dots in the
>>> pictures. I can use the following command to run 12 procs per node, but
>>> PE=1 makes all the 12 processes will run in just 1 socket. PE=2 will make a
>>> process bind to 2 cores, which is not what I want.
>>>
>>> --map-by ppr:12:node:PE=1,SPAN
>>>
>>> Thank you,
>>> Saliya
>>>
>>> 
>>>
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>> Cell 812-391-4914
>>> http://saliya.org
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/09/27558.php
>>>
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/09/27559.php
>>>
>>
>>
>>
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> Cell 812-391-4914
>> http://saliya.org
>>
>>
>> ___
>> users mailing listus...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/09/27560.php
>>
>>
>> --
>> M.Sc. Tobias Klöffel
>> ===
>> Interdisciplinary Center for Molecular Materials (ICMM)
>> and Computer-Chemistry-Center (CCC)
>> Department Chemie und Pharmazie
>> Friedrich-Alexander-Universität Erlangen-Nürnberg
>> Nägelsbachstr. 25
>> D-91052 Erlangen, Germany
>>
>> Room: 2.307
>> Phone: +49 (0) 9131 / 85 - 20421
>> Fax: +49 (0) 9131 / 85 - 26565
>>
>> ===
>>
>>
>> E-mail: tobias.kloef...@fau.de
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/09/27561.php
>>
>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/09/27562.php
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: 

Re: [OMPI users] Help with Specific Binding

2015-09-13 Thread Ralph Castain
Actually, the error was correct - it was me that was incorrect. The correct set 
of options would be:

—map-by ppr:12_node —bind-to core —cpu-set=0,2,4,…

Sorry about the confusion


> On Sep 13, 2015, at 2:43 AM, Ralph Castain  wrote:
> 
> The rankfile will certainly do it, but that error is a bug and I’ll have to 
> fix it.
> 
> 
>> On Sep 13, 2015, at 1:10 AM, Saliya Ekanayake > > wrote:
>> 
>> I could get it working by manually generating a rankfile all the ranks and 
>> not using any --map-by options.
>> 
>> I'll try the --map-by core as well
>> 
>> On Sun, Sep 13, 2015 at 3:59 AM, Tobias Kloeffel > > wrote:
>> Hi,
>> use: --map-by core
>> 
>> regards,
>> Tobias
>> 
>> 
>> On 09/13/2015 09:41 AM, Saliya Ekanayake wrote:
>>> I tried,
>>> 
>>>  --map-by ppr:12:node --slot-list 0,2,4,6,8,10,12,14,16,18,20,22 --bind-to 
>>> core -np 12
>>> 
>>> but it complains,
>>> 
>>> "Conflicting directives for binding policy are causing the policy
>>> to be redefined:
>>> 
>>>   New policy:   socket
>>>   Prior policy:  CORE
>>> 
>>> Please check that only one policy is defined.
>>> "
>>> 
>>> On Sun, Sep 13, 2015 at 2:57 AM, Ralph Castain >> > wrote:
>>> Try something like this instead:
>>> 
>>> —map-by ppr:12:node —bind-to core —slot-list=0,2,4,6,8,…
>>> 
>>> You’ll have to play a bit with the core numbers in the slot-list to get the 
>>> numbering right as I don’t know how your machine numbers them, and I can’t 
>>> guarantee it will work - but it’s worth a shot. If it doesn’t, then I may 
>>> have to add an option for such purposes
>>> 
>>> Ralph
>>> 
 On Sep 12, 2015, at 7:39 PM, Saliya Ekanayake > wrote:
 
 Hi,
 
 We've a machine as in the following picture. I'd like to run 12 MPI procs 
 per node each bound to 1 core, but like shown in blue dots in the 
 pictures. I can use the following command to run 12 procs per node, but 
 PE=1 makes all the 12 processes will run in just 1 socket. PE=2 will make 
 a process bind to 2 cores, which is not what I want. 
 
 --map-by ppr:12:node:PE=1,SPAN
 
 Thank you,
 Saliya
 
 
 
 -- 
 Saliya Ekanayake
 Ph.D. Candidate | Research Assistant
 School of Informatics and Computing | Digital Science Center
 Indiana University, Bloomington
 Cell 812-391-4914 
  http://saliya.org 
 ___
 users mailing list
 us...@open-mpi.org 
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
 
 Link to this post: 
 http://www.open-mpi.org/community/lists/users/2015/09/27558.php 
 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org 
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> 
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/09/27559.php 
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>> Cell 812-391-4914 
>>> http://saliya.org 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org 
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> 
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/09/27560.php 
>>> 
>> -- 
>> M.Sc. Tobias Klöffel
>> ===
>> Interdisciplinary Center for Molecular Materials (ICMM)
>> and Computer-Chemistry-Center (CCC)
>> Department Chemie und Pharmazie
>> Friedrich-Alexander-Universität Erlangen-Nürnberg
>> Nägelsbachstr. 25
>> D-91052 Erlangen, Germany
>> 
>> Room: 2.307
>> Phone: +49 (0) 9131 / 85 - 20421 
>> 
>> Fax: +49 (0) 9131 / 85 - 26565 
>> 
>> 
>> ===
>> 
>> 
>> E-mail: tobias.kloef...@fau.de 
>> ___
>> users mailing list
>> us...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> 

Re: [OMPI users] Help with Specific Binding

2015-09-13 Thread Ralph Castain
The rankfile will certainly do it, but that error is a bug and I’ll have to fix 
it.


> On Sep 13, 2015, at 1:10 AM, Saliya Ekanayake  wrote:
> 
> I could get it working by manually generating a rankfile all the ranks and 
> not using any --map-by options.
> 
> I'll try the --map-by core as well
> 
> On Sun, Sep 13, 2015 at 3:59 AM, Tobias Kloeffel  > wrote:
> Hi,
> use: --map-by core
> 
> regards,
> Tobias
> 
> 
> On 09/13/2015 09:41 AM, Saliya Ekanayake wrote:
>> I tried,
>> 
>>  --map-by ppr:12:node --slot-list 0,2,4,6,8,10,12,14,16,18,20,22 --bind-to 
>> core -np 12
>> 
>> but it complains,
>> 
>> "Conflicting directives for binding policy are causing the policy
>> to be redefined:
>> 
>>   New policy:   socket
>>   Prior policy:  CORE
>> 
>> Please check that only one policy is defined.
>> "
>> 
>> On Sun, Sep 13, 2015 at 2:57 AM, Ralph Castain > > wrote:
>> Try something like this instead:
>> 
>> —map-by ppr:12:node —bind-to core —slot-list=0,2,4,6,8,…
>> 
>> You’ll have to play a bit with the core numbers in the slot-list to get the 
>> numbering right as I don’t know how your machine numbers them, and I can’t 
>> guarantee it will work - but it’s worth a shot. If it doesn’t, then I may 
>> have to add an option for such purposes
>> 
>> Ralph
>> 
>>> On Sep 12, 2015, at 7:39 PM, Saliya Ekanayake >> > wrote:
>>> 
>>> Hi,
>>> 
>>> We've a machine as in the following picture. I'd like to run 12 MPI procs 
>>> per node each bound to 1 core, but like shown in blue dots in the pictures. 
>>> I can use the following command to run 12 procs per node, but PE=1 makes 
>>> all the 12 processes will run in just 1 socket. PE=2 will make a process 
>>> bind to 2 cores, which is not what I want. 
>>> 
>>> --map-by ppr:12:node:PE=1,SPAN
>>> 
>>> Thank you,
>>> Saliya
>>> 
>>> 
>>> 
>>> -- 
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>> Cell 812-391-4914 
>>>  http://saliya.org 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org 
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> 
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/09/27558.php 
>>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/09/27559.php 
>> 
>> 
>> 
>> 
>> -- 
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> Cell 812-391-4914 
>> http://saliya.org 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/09/27560.php 
>> 
> -- 
> M.Sc. Tobias Klöffel
> ===
> Interdisciplinary Center for Molecular Materials (ICMM)
> and Computer-Chemistry-Center (CCC)
> Department Chemie und Pharmazie
> Friedrich-Alexander-Universität Erlangen-Nürnberg
> Nägelsbachstr. 25
> D-91052 Erlangen, Germany
> 
> Room: 2.307
> Phone: +49 (0) 9131 / 85 - 20421 
> 
> Fax: +49 (0) 9131 / 85 - 26565 
> 
> 
> ===
> 
> 
> E-mail: tobias.kloef...@fau.de 
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/09/27561.php 
> 
> 
> 
> 
> -- 
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org 
> 

Re: [OMPI users] Help with Specific Binding

2015-09-13 Thread Saliya Ekanayake
I could get it working by manually generating a rankfile all the ranks and
not using any --map-by options.

I'll try the --map-by core as well

On Sun, Sep 13, 2015 at 3:59 AM, Tobias Kloeffel 
wrote:

> Hi,
> use: --map-by core
>
> regards,
> Tobias
>
>
> On 09/13/2015 09:41 AM, Saliya Ekanayake wrote:
>
> I tried,
>
>  --map-by ppr:12:node --slot-list 0,2,4,6,8,10,12,14,16,18,20,22 --bind-to
> core -np 12
>
> but it complains,
>
> "Conflicting directives for binding policy are causing the policy
> to be redefined:
>
>   New policy:   socket
>   Prior policy:  CORE
>
> Please check that only one policy is defined.
> "
>
> On Sun, Sep 13, 2015 at 2:57 AM, Ralph Castain  wrote:
>
>> Try something like this instead:
>>
>> —map-by ppr:12:node —bind-to core —slot-list=0,2,4,6,8,…
>>
>> You’ll have to play a bit with the core numbers in the slot-list to get
>> the numbering right as I don’t know how your machine numbers them, and I
>> can’t guarantee it will work - but it’s worth a shot. If it doesn’t, then I
>> may have to add an option for such purposes
>>
>> Ralph
>>
>> On Sep 12, 2015, at 7:39 PM, Saliya Ekanayake  wrote:
>>
>> Hi,
>>
>> We've a machine as in the following picture. I'd like to run 12 MPI procs
>> per node each bound to 1 core, but like shown in blue dots in the pictures.
>> I can use the following command to run 12 procs per node, but PE=1 makes
>> all the 12 processes will run in just 1 socket. PE=2 will make a process
>> bind to 2 cores, which is not what I want.
>>
>> --map-by ppr:12:node:PE=1,SPAN
>>
>> Thank you,
>> Saliya
>>
>> 
>>
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> Cell 812-391-4914
>> http://saliya.org
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/09/27558.php
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/09/27559.php
>>
>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org
>
>
> ___
> users mailing listus...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/09/27560.php
>
>
> --
> M.Sc. Tobias Klöffel
> ===
> Interdisciplinary Center for Molecular Materials (ICMM)
> and Computer-Chemistry-Center (CCC)
> Department Chemie und Pharmazie
> Friedrich-Alexander-Universität Erlangen-Nürnberg
> Nägelsbachstr. 25
> D-91052 Erlangen, Germany
>
> Room: 2.307
> Phone: +49 (0) 9131 / 85 - 20421
> Fax: +49 (0) 9131 / 85 - 26565
>
> ===
>
>
> E-mail: tobias.kloef...@fau.de
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/09/27561.php
>



-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org


Re: [OMPI users] Help with Specific Binding

2015-09-13 Thread Tobias Kloeffel

Hi,
use: --map-by core

regards,
Tobias

On 09/13/2015 09:41 AM, Saliya Ekanayake wrote:

I tried,

 --map-by ppr:12:node --slot-list 0,2,4,6,8,10,12,14,16,18,20,22 
--bind-to core -np 12


but it complains,

"Conflicting directives for binding policy are causing the policy
to be redefined:

  New policy:   socket
  Prior policy:  CORE

Please check that only one policy is defined.
"

On Sun, Sep 13, 2015 at 2:57 AM, Ralph Castain > wrote:


Try something like this instead:

—map-by ppr:12:node —bind-to core —slot-list=0,2,4,6,8,…

You’ll have to play a bit with the core numbers in the slot-list
to get the numbering right as I don’t know how your machine
numbers them, and I can’t guarantee it will work - but it’s worth
a shot. If it doesn’t, then I may have to add an option for such
purposes

Ralph


On Sep 12, 2015, at 7:39 PM, Saliya Ekanayake > wrote:

Hi,

We've a machine as in the following picture. I'd like to run 12
MPI procs per node each bound to 1 core, but like shown in blue
dots in the pictures. I can use the following command to run 12
procs per node, but PE=1 makes all the 12 processes will run in
just 1 socket. PE=2 will make a process bind to 2 cores, which is
not what I want.

--map-by ppr:12:node:PE=1,SPAN

Thank you,
Saliya



-- 
Saliya Ekanayake

Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell 812-391-4914 
http://saliya.org 
___
users mailing list
us...@open-mpi.org 
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/09/27558.php



___
users mailing list
us...@open-mpi.org 
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/09/27559.php




--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/09/27560.php


--
M.Sc. Tobias Klöffel
===
Interdisciplinary Center for Molecular Materials (ICMM)
and Computer-Chemistry-Center (CCC)
Department Chemie und Pharmazie
Friedrich-Alexander-Universität Erlangen-Nürnberg
Nägelsbachstr. 25
D-91052 Erlangen, Germany

Room: 2.307
Phone: +49 (0) 9131 / 85 - 20421
Fax: +49 (0) 9131 / 85 - 26565

===


E-mail: tobias.kloef...@fau.de



Re: [OMPI users] Help with Specific Binding

2015-09-13 Thread Saliya Ekanayake
I tried,

 --map-by ppr:12:node --slot-list 0,2,4,6,8,10,12,14,16,18,20,22 --bind-to
core -np 12

but it complains,

"Conflicting directives for binding policy are causing the policy
to be redefined:

  New policy:   socket
  Prior policy:  CORE

Please check that only one policy is defined.
"

On Sun, Sep 13, 2015 at 2:57 AM, Ralph Castain  wrote:

> Try something like this instead:
>
> —map-by ppr:12:node —bind-to core —slot-list=0,2,4,6,8,…
>
> You’ll have to play a bit with the core numbers in the slot-list to get
> the numbering right as I don’t know how your machine numbers them, and I
> can’t guarantee it will work - but it’s worth a shot. If it doesn’t, then I
> may have to add an option for such purposes
>
> Ralph
>
> On Sep 12, 2015, at 7:39 PM, Saliya Ekanayake  wrote:
>
> Hi,
>
> We've a machine as in the following picture. I'd like to run 12 MPI procs
> per node each bound to 1 core, but like shown in blue dots in the pictures.
> I can use the following command to run 12 procs per node, but PE=1 makes
> all the 12 processes will run in just 1 socket. PE=2 will make a process
> bind to 2 cores, which is not what I want.
>
> --map-by ppr:12:node:PE=1,SPAN
>
> Thank you,
> Saliya
>
> 
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/09/27558.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/09/27559.php
>



-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org


Re: [OMPI users] Help with Specific Binding

2015-09-13 Thread Ralph Castain
Try something like this instead:

—map-by ppr:12:node —bind-to core —slot-list=0,2,4,6,8,…

You’ll have to play a bit with the core numbers in the slot-list to get the 
numbering right as I don’t know how your machine numbers them, and I can’t 
guarantee it will work - but it’s worth a shot. If it doesn’t, then I may have 
to add an option for such purposes

Ralph

> On Sep 12, 2015, at 7:39 PM, Saliya Ekanayake  wrote:
> 
> Hi,
> 
> We've a machine as in the following picture. I'd like to run 12 MPI procs per 
> node each bound to 1 core, but like shown in blue dots in the pictures. I can 
> use the following command to run 12 procs per node, but PE=1 makes all the 12 
> processes will run in just 1 socket. PE=2 will make a process bind to 2 
> cores, which is not what I want. 
> 
> --map-by ppr:12:node:PE=1,SPAN
> 
> Thank you,
> Saliya
> 
> 
> 
> -- 
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/09/27558.php



Re: [OMPI users] Help : Slowness with OpenMPI (1.8.1) and Numpy

2015-06-12 Thread Ralph Castain
Is this a threaded code? If so, you should add —bind-to none to your 1.8 series 
command line


> On Jun 12, 2015, at 7:58 AM, kishor sharma  wrote:
> 
> Hi There,
> 
> 
> 
> I am facing slowness running numpy code using mpirun with openmpi 1.8.1 
> version.
> 
> 
> 
> With Open MPI (1.8.1)
> 
> -
> 
> > /usr/lib64/openmpi/bin/mpirun -version
> 
> mpirun (Open MPI) 1.8.1
> 
>  
> Report bugs to http://www.open-mpi.org/community/help/ 
> 
> >  time /usr/lib64/openmpi/bin/mpirun -np 1 python -c 'import numpy; 
> > numpy.linalg.svd(numpy.eye(1000))'
> 
> real 23.75
> 
> user 6.95
> 
> sys 16.68
> 
> > 
> 
> 
> 
> 
> 
> With Open MPI (1.5.4):
> 
> -
> 
> > /usr/lib64/openmpi/bin/mpirun -version
> 
> mpirun (Open MPI) 1.5.4
> 
>  
> Report bugs to http://www.open-mpi.org/community/help/ 
> 
> > time /usr/lib64/openmpi/bin/mpirun -np 1 python -c 'import numpy; 
> > numpy.linalg.svd(numpy.eye(1000))'
> 
> real 1.35
> 
> user 2.11
> 
> sys 0.71
> 
> >
> 
> 
> 
> Do you guys have any idea why the above function is 10-15x with openmpi 
> version 1.8.1
> 
> 
> 
> Thanks,
> 
> Kishor
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/06/27123.php



Re: [OMPI users] help in execution mpi

2015-04-23 Thread Ralph Castain
Use “orte_rsh_agent = rsh” instead


> On Apr 23, 2015, at 10:48 AM, rebona...@upf.br wrote:
> 
> Hi all
> 
> I am install mpi (version 1.6.5) at ubuntu 14.04. I am teach parallel 
> programming in undergraduate course.
> I wnat use rsh instead ssh (default).
> I change the file "openmpi-mca-params.conf" and put there plm_rsh_agent = rsh 
> .
> The mpi application work, but a message appear for each process created:
> 
> /* begin message */
> --
> A deprecated MCA parameter value was specified in an MCA parameter
> file.  Deprecated MCA parameters should be avoided; they may disappear
> in future releases.
> 
>  Deprecated parameter: plm_rsh_agent
> --
> /* end message */
> 
> It's bad for explanation with students. There is any form to supress these 
> warning messages?
> 
> Thank's a lot.
> 
> 
> Marcelo Trindade Rebonatto
> Passo Fundo University - Brazil
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26773.php



Re: [OMPI users] Help on getting CMA works

2015-02-24 Thread Nathan Hjelm

I don't know the reasoning for requiring --with-cma to enable CMA but I
am looking at auto-detecting CMA instead of requiring Open MPI to be
configured with --with-cma. This will likely go into the 1.9 release
series and not 1.8.

-Nathan

On Thu, Feb 19, 2015 at 09:31:43PM -0500, Eric Chamberland wrote:
> Maybe it is a stupid question, but... why it is not tested and enabled by
> default at configure time since it is part of the kernel?
> 
> Eric
> 
> 
> On 02/19/2015 03:53 PM, Nathan Hjelm wrote:
> >Great! I will add an MCA variable to force CMA and also enable it if 1)
> >no yama and 2) no PR_SET_PTRACER.
> >
> >You might also look at using xpmem. You can find a version that supports
> >3.x @ https://github.com/hjelmn/xpmem . It is a kernel module +
> >userspace library that can be used by vader as a single-copy mechanism.
> >
> >In benchmarks it performs better than CMA but it may or may not perform
> >better with a real application.
> >
> >See:
> >
> >http://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy
> >
> >-Nathan
> >
> >On Thu, Feb 19, 2015 at 03:32:43PM -0500, Eric Chamberland wrote:
> >>On 02/19/2015 02:58 PM, Nathan Hjelm wrote:
> >>>On Thu, Feb 19, 2015 at 12:16:49PM -0500, Eric Chamberland wrote:
> On 02/19/2015 11:56 AM, Nathan Hjelm wrote:
> >If you have yama installed you can try:
> Nope, I do not have it installed... is it absolutely necessary? (and would
> it change something when it fails when I am root?)
> 
> Other question: In addition to "--with-cma" configure flag, do we have to
> pass any options to "mpicc" when compiling/linking an mpi application to 
> use
> cma?
> >>>No. CMA should work out of the box. You appear to have a setup I haven't
> >>>yet tested. It doesn't have yama nor does it have the PR_SET_PTRACER
> >>>prctl. Its quite possible there are no restriction on ptrace in this
> >>>setup. Can you try changing the following line at
> >>>opal/mca/btl/vader/btl_vader_component.c:370 from:
> >>>
> >>>bool cma_happy = false;
> >>>
> >>>to
> >>>
> >>>bool cma_happy = true;
> >>>
> >>ok! (as of the officiel release, this is line 386.)
> >>
> >>>and let me know if that works. If it does I will update vader to allow
> >>>CMA in this configuration.
> >>Yep!  It now works perfectly.  Testing with
> >>https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_bandwidth.c, on my
> >>own computer (dual Xeon), I have this:
> >>
> >>Without CMA:
> >>
> >>***Message size:  100 *** best  /  avg  / worst (MB/sec)
> >>task pair:0 -1:8363.52 / 7946.77 / 5391.14
> >>
> >>with CMA:
> >>task pair:0 -1:9137.92 / 8955.98 / 7489.83
> >>
> >>Great!
> >>
> >>Now I have to bench my real application... ;-)
> >>
> >>Thanks!
> >>
> >>Eric
> >>
> >>___
> >>users mailing list
> >>us...@open-mpi.org
> >>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>Link to this post: 
> >>http://www.open-mpi.org/community/lists/users/2015/02/26355.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/02/26362.php


pgpKWremq_SLC.pgp
Description: PGP signature


Re: [OMPI users] Help on getting CMA works

2015-02-19 Thread Eric Chamberland
Maybe it is a stupid question, but... why it is not tested and enabled 
by default at configure time since it is part of the kernel?


Eric


On 02/19/2015 03:53 PM, Nathan Hjelm wrote:

Great! I will add an MCA variable to force CMA and also enable it if 1)
no yama and 2) no PR_SET_PTRACER.

You might also look at using xpmem. You can find a version that supports
3.x @ https://github.com/hjelmn/xpmem . It is a kernel module +
userspace library that can be used by vader as a single-copy mechanism.

In benchmarks it performs better than CMA but it may or may not perform
better with a real application.

See:

http://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy

-Nathan

On Thu, Feb 19, 2015 at 03:32:43PM -0500, Eric Chamberland wrote:

On 02/19/2015 02:58 PM, Nathan Hjelm wrote:

On Thu, Feb 19, 2015 at 12:16:49PM -0500, Eric Chamberland wrote:

On 02/19/2015 11:56 AM, Nathan Hjelm wrote:

If you have yama installed you can try:

Nope, I do not have it installed... is it absolutely necessary? (and would
it change something when it fails when I am root?)

Other question: In addition to "--with-cma" configure flag, do we have to
pass any options to "mpicc" when compiling/linking an mpi application to use
cma?

No. CMA should work out of the box. You appear to have a setup I haven't
yet tested. It doesn't have yama nor does it have the PR_SET_PTRACER
prctl. Its quite possible there are no restriction on ptrace in this
setup. Can you try changing the following line at
opal/mca/btl/vader/btl_vader_component.c:370 from:

bool cma_happy = false;

to

bool cma_happy = true;


ok! (as of the officiel release, this is line 386.)


and let me know if that works. If it does I will update vader to allow
CMA in this configuration.

Yep!  It now works perfectly.  Testing with
https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_bandwidth.c, on my
own computer (dual Xeon), I have this:

Without CMA:

***Message size:  100 *** best  /  avg  / worst (MB/sec)
task pair:0 -1:8363.52 / 7946.77 / 5391.14

with CMA:
task pair:0 -1:9137.92 / 8955.98 / 7489.83

Great!

Now I have to bench my real application... ;-)

Thanks!

Eric

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/02/26355.php




Re: [OMPI users] Help on getting CMA works

2015-02-19 Thread Nathan Hjelm

Aurélien, I should also point out your fix has already been applied to
the 1.8 branch and will be included in 1.8.5.

-Nathan

On Thu, Feb 19, 2015 at 02:57:38PM -0700, Nathan Hjelm wrote:
> 
> Hmm, wait. Yes. Your change went in after 1.8.4 and has the same
> effect. If yama ins't installed it is safe to assume that the ptrace
> scope is effectively 0. So, your patch does fix the issue.
> 
> -Nathan
> 
> On Thu, Feb 19, 2015 at 02:53:47PM -0700, Nathan Hjelm wrote:
> > 
> > I don't think that will fix this issue. In this case yama is not
> > installed and it appears PR_SET_PTRACER is not available. This forces
> > vader to assume that CMA can not be used when that isn't always the
> > case. I think it might be safe to assume that CMA is unrestricted here.
> > 
> > -Nathan
> > 
> > On Thu, Feb 19, 2015 at 04:35:00PM -0500, Aurélien Bouteiller wrote:
> > > Nathan, 
> > > 
> > > I think I already pushed a patch for this particular issue last month. I 
> > > do not know if it has been back ported to release yet. 
> > > 
> > > See 
> > > here:https://github.com/open-mpi/ompi/commit/ee3b0903164898750137d3b71a8f067e16521102
> > > 
> > > Aurelien 
> > > 
> > > --
> > >   ~~~ Aurélien Bouteiller, Ph.D. ~~~
> > >  ~ Research Scientist @ ICL ~
> > > The University of Tennessee, Innovative Computing Laboratory
> > > 1122 Volunteer Blvd, suite 309, Knoxville, TN 37996
> > > tel: +1 (865) 974-9375   fax: +1 (865) 974-8296
> > > https://icl.cs.utk.edu/~bouteill/
> > > 
> > > 
> > > 
> > > 
> > > > Le 19 févr. 2015 à 15:53, Nathan Hjelm  a écrit :
> > > > 
> > > > 
> > > > Great! I will add an MCA variable to force CMA and also enable it if 1)
> > > > no yama and 2) no PR_SET_PTRACER.
> > > > 
> > > > You might also look at using xpmem. You can find a version that supports
> > > > 3.x @ https://github.com/hjelmn/xpmem . It is a kernel module +
> > > > userspace library that can be used by vader as a single-copy mechanism.
> > > > 
> > > > In benchmarks it performs better than CMA but it may or may not perform
> > > > better with a real application.
> > > > 
> > > > See:
> > > > 
> > > > http://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy
> > > > 
> > > > -Nathan
> > > > 
> > > > On Thu, Feb 19, 2015 at 03:32:43PM -0500, Eric Chamberland wrote:
> > > >> On 02/19/2015 02:58 PM, Nathan Hjelm wrote:
> > > >>> On Thu, Feb 19, 2015 at 12:16:49PM -0500, Eric Chamberland wrote:
> > >  
> > >  On 02/19/2015 11:56 AM, Nathan Hjelm wrote:
> > > > 
> > > > If you have yama installed you can try:
> > >  
> > >  Nope, I do not have it installed... is it absolutely necessary? (and 
> > >  would
> > >  it change something when it fails when I am root?)
> > >  
> > >  Other question: In addition to "--with-cma" configure flag, do we 
> > >  have to
> > >  pass any options to "mpicc" when compiling/linking an mpi 
> > >  application to use
> > >  cma?
> > > >>> 
> > > >>> No. CMA should work out of the box. You appear to have a setup I 
> > > >>> haven't
> > > >>> yet tested. It doesn't have yama nor does it have the PR_SET_PTRACER
> > > >>> prctl. Its quite possible there are no restriction on ptrace in this
> > > >>> setup. Can you try changing the following line at
> > > >>> opal/mca/btl/vader/btl_vader_component.c:370 from:
> > > >>> 
> > > >>> bool cma_happy = false;
> > > >>> 
> > > >>> to
> > > >>> 
> > > >>> bool cma_happy = true;
> > > >>> 
> > > >> 
> > > >> ok! (as of the officiel release, this is line 386.)
> > > >> 
> > > >>> and let me know if that works. If it does I will update vader to allow
> > > >>> CMA in this configuration.
> > > >> 
> > > >> Yep!  It now works perfectly.  Testing with
> > > >> https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_bandwidth.c, on 
> > > >> my
> > > >> own computer (dual Xeon), I have this:
> > > >> 
> > > >> Without CMA:
> > > >> 
> > > >> ***Message size:  100 *** best  /  avg  / worst (MB/sec)
> > > >>   task pair:0 -1:8363.52 / 7946.77 / 5391.14
> > > >> 
> > > >> with CMA:
> > > >>   task pair:0 -1:9137.92 / 8955.98 / 7489.83
> > > >> 
> > > >> Great!
> > > >> 
> > > >> Now I have to bench my real application... ;-)
> > > >> 
> > > >> Thanks!
> > > >> 
> > > >> Eric
> > > >> 
> > > >> ___
> > > >> users mailing list
> > > >> us...@open-mpi.org
> > > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > >> Link to this post: 
> > > >> http://www.open-mpi.org/community/lists/users/2015/02/26355.php
> > > > ___
> > > > users mailing list
> > > > us...@open-mpi.org
> > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > > Link to this post: 
> > > > http://www.open-mpi.org/community/lists/users/2015/02/26356.php
> > > 
> > > 

Re: [OMPI users] Help on getting CMA works

2015-02-19 Thread Nathan Hjelm

Hmm, wait. Yes. Your change went in after 1.8.4 and has the same
effect. If yama ins't installed it is safe to assume that the ptrace
scope is effectively 0. So, your patch does fix the issue.

-Nathan

On Thu, Feb 19, 2015 at 02:53:47PM -0700, Nathan Hjelm wrote:
> 
> I don't think that will fix this issue. In this case yama is not
> installed and it appears PR_SET_PTRACER is not available. This forces
> vader to assume that CMA can not be used when that isn't always the
> case. I think it might be safe to assume that CMA is unrestricted here.
> 
> -Nathan
> 
> On Thu, Feb 19, 2015 at 04:35:00PM -0500, Aurélien Bouteiller wrote:
> > Nathan, 
> > 
> > I think I already pushed a patch for this particular issue last month. I do 
> > not know if it has been back ported to release yet. 
> > 
> > See 
> > here:https://github.com/open-mpi/ompi/commit/ee3b0903164898750137d3b71a8f067e16521102
> > 
> > Aurelien 
> > 
> > --
> >   ~~~ Aurélien Bouteiller, Ph.D. ~~~
> >  ~ Research Scientist @ ICL ~
> > The University of Tennessee, Innovative Computing Laboratory
> > 1122 Volunteer Blvd, suite 309, Knoxville, TN 37996
> > tel: +1 (865) 974-9375   fax: +1 (865) 974-8296
> > https://icl.cs.utk.edu/~bouteill/
> > 
> > 
> > 
> > 
> > > Le 19 févr. 2015 à 15:53, Nathan Hjelm  a écrit :
> > > 
> > > 
> > > Great! I will add an MCA variable to force CMA and also enable it if 1)
> > > no yama and 2) no PR_SET_PTRACER.
> > > 
> > > You might also look at using xpmem. You can find a version that supports
> > > 3.x @ https://github.com/hjelmn/xpmem . It is a kernel module +
> > > userspace library that can be used by vader as a single-copy mechanism.
> > > 
> > > In benchmarks it performs better than CMA but it may or may not perform
> > > better with a real application.
> > > 
> > > See:
> > > 
> > > http://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy
> > > 
> > > -Nathan
> > > 
> > > On Thu, Feb 19, 2015 at 03:32:43PM -0500, Eric Chamberland wrote:
> > >> On 02/19/2015 02:58 PM, Nathan Hjelm wrote:
> > >>> On Thu, Feb 19, 2015 at 12:16:49PM -0500, Eric Chamberland wrote:
> >  
> >  On 02/19/2015 11:56 AM, Nathan Hjelm wrote:
> > > 
> > > If you have yama installed you can try:
> >  
> >  Nope, I do not have it installed... is it absolutely necessary? (and 
> >  would
> >  it change something when it fails when I am root?)
> >  
> >  Other question: In addition to "--with-cma" configure flag, do we have 
> >  to
> >  pass any options to "mpicc" when compiling/linking an mpi application 
> >  to use
> >  cma?
> > >>> 
> > >>> No. CMA should work out of the box. You appear to have a setup I haven't
> > >>> yet tested. It doesn't have yama nor does it have the PR_SET_PTRACER
> > >>> prctl. Its quite possible there are no restriction on ptrace in this
> > >>> setup. Can you try changing the following line at
> > >>> opal/mca/btl/vader/btl_vader_component.c:370 from:
> > >>> 
> > >>> bool cma_happy = false;
> > >>> 
> > >>> to
> > >>> 
> > >>> bool cma_happy = true;
> > >>> 
> > >> 
> > >> ok! (as of the officiel release, this is line 386.)
> > >> 
> > >>> and let me know if that works. If it does I will update vader to allow
> > >>> CMA in this configuration.
> > >> 
> > >> Yep!  It now works perfectly.  Testing with
> > >> https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_bandwidth.c, on my
> > >> own computer (dual Xeon), I have this:
> > >> 
> > >> Without CMA:
> > >> 
> > >> ***Message size:  100 *** best  /  avg  / worst (MB/sec)
> > >>   task pair:0 -1:8363.52 / 7946.77 / 5391.14
> > >> 
> > >> with CMA:
> > >>   task pair:0 -1:9137.92 / 8955.98 / 7489.83
> > >> 
> > >> Great!
> > >> 
> > >> Now I have to bench my real application... ;-)
> > >> 
> > >> Thanks!
> > >> 
> > >> Eric
> > >> 
> > >> ___
> > >> users mailing list
> > >> us...@open-mpi.org
> > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >> Link to this post: 
> > >> http://www.open-mpi.org/community/lists/users/2015/02/26355.php
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > Link to this post: 
> > > http://www.open-mpi.org/community/lists/users/2015/02/26356.php
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/users/2015/02/26358.php



> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/02/26359.php



Re: [OMPI users] Help on getting CMA works

2015-02-19 Thread Nathan Hjelm

I don't think that will fix this issue. In this case yama is not
installed and it appears PR_SET_PTRACER is not available. This forces
vader to assume that CMA can not be used when that isn't always the
case. I think it might be safe to assume that CMA is unrestricted here.

-Nathan

On Thu, Feb 19, 2015 at 04:35:00PM -0500, Aurélien Bouteiller wrote:
> Nathan, 
> 
> I think I already pushed a patch for this particular issue last month. I do 
> not know if it has been back ported to release yet. 
> 
> See 
> here:https://github.com/open-mpi/ompi/commit/ee3b0903164898750137d3b71a8f067e16521102
> 
> Aurelien 
> 
> --
>   ~~~ Aurélien Bouteiller, Ph.D. ~~~
>  ~ Research Scientist @ ICL ~
> The University of Tennessee, Innovative Computing Laboratory
> 1122 Volunteer Blvd, suite 309, Knoxville, TN 37996
> tel: +1 (865) 974-9375   fax: +1 (865) 974-8296
> https://icl.cs.utk.edu/~bouteill/
> 
> 
> 
> 
> > Le 19 févr. 2015 à 15:53, Nathan Hjelm  a écrit :
> > 
> > 
> > Great! I will add an MCA variable to force CMA and also enable it if 1)
> > no yama and 2) no PR_SET_PTRACER.
> > 
> > You might also look at using xpmem. You can find a version that supports
> > 3.x @ https://github.com/hjelmn/xpmem . It is a kernel module +
> > userspace library that can be used by vader as a single-copy mechanism.
> > 
> > In benchmarks it performs better than CMA but it may or may not perform
> > better with a real application.
> > 
> > See:
> > 
> > http://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy
> > 
> > -Nathan
> > 
> > On Thu, Feb 19, 2015 at 03:32:43PM -0500, Eric Chamberland wrote:
> >> On 02/19/2015 02:58 PM, Nathan Hjelm wrote:
> >>> On Thu, Feb 19, 2015 at 12:16:49PM -0500, Eric Chamberland wrote:
>  
>  On 02/19/2015 11:56 AM, Nathan Hjelm wrote:
> > 
> > If you have yama installed you can try:
>  
>  Nope, I do not have it installed... is it absolutely necessary? (and 
>  would
>  it change something when it fails when I am root?)
>  
>  Other question: In addition to "--with-cma" configure flag, do we have to
>  pass any options to "mpicc" when compiling/linking an mpi application to 
>  use
>  cma?
> >>> 
> >>> No. CMA should work out of the box. You appear to have a setup I haven't
> >>> yet tested. It doesn't have yama nor does it have the PR_SET_PTRACER
> >>> prctl. Its quite possible there are no restriction on ptrace in this
> >>> setup. Can you try changing the following line at
> >>> opal/mca/btl/vader/btl_vader_component.c:370 from:
> >>> 
> >>> bool cma_happy = false;
> >>> 
> >>> to
> >>> 
> >>> bool cma_happy = true;
> >>> 
> >> 
> >> ok! (as of the officiel release, this is line 386.)
> >> 
> >>> and let me know if that works. If it does I will update vader to allow
> >>> CMA in this configuration.
> >> 
> >> Yep!  It now works perfectly.  Testing with
> >> https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_bandwidth.c, on my
> >> own computer (dual Xeon), I have this:
> >> 
> >> Without CMA:
> >> 
> >> ***Message size:  100 *** best  /  avg  / worst (MB/sec)
> >>   task pair:0 -1:8363.52 / 7946.77 / 5391.14
> >> 
> >> with CMA:
> >>   task pair:0 -1:9137.92 / 8955.98 / 7489.83
> >> 
> >> Great!
> >> 
> >> Now I have to bench my real application... ;-)
> >> 
> >> Thanks!
> >> 
> >> Eric
> >> 
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> Link to this post: 
> >> http://www.open-mpi.org/community/lists/users/2015/02/26355.php
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/users/2015/02/26356.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/02/26358.php


pgp7NUNlOKKzV.pgp
Description: PGP signature


Re: [OMPI users] Help on getting CMA works

2015-02-19 Thread Aurélien Bouteiller
Nathan, 

I think I already pushed a patch for this particular issue last month. I do not 
know if it has been back ported to release yet. 

See 
here:https://github.com/open-mpi/ompi/commit/ee3b0903164898750137d3b71a8f067e16521102

Aurelien 

--
  ~~~ Aurélien Bouteiller, Ph.D. ~~~
 ~ Research Scientist @ ICL ~
The University of Tennessee, Innovative Computing Laboratory
1122 Volunteer Blvd, suite 309, Knoxville, TN 37996
tel: +1 (865) 974-9375   fax: +1 (865) 974-8296
https://icl.cs.utk.edu/~bouteill/




> Le 19 févr. 2015 à 15:53, Nathan Hjelm  a écrit :
> 
> 
> Great! I will add an MCA variable to force CMA and also enable it if 1)
> no yama and 2) no PR_SET_PTRACER.
> 
> You might also look at using xpmem. You can find a version that supports
> 3.x @ https://github.com/hjelmn/xpmem . It is a kernel module +
> userspace library that can be used by vader as a single-copy mechanism.
> 
> In benchmarks it performs better than CMA but it may or may not perform
> better with a real application.
> 
> See:
> 
> http://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy
> 
> -Nathan
> 
> On Thu, Feb 19, 2015 at 03:32:43PM -0500, Eric Chamberland wrote:
>> On 02/19/2015 02:58 PM, Nathan Hjelm wrote:
>>> On Thu, Feb 19, 2015 at 12:16:49PM -0500, Eric Chamberland wrote:
 
 On 02/19/2015 11:56 AM, Nathan Hjelm wrote:
> 
> If you have yama installed you can try:
 
 Nope, I do not have it installed... is it absolutely necessary? (and would
 it change something when it fails when I am root?)
 
 Other question: In addition to "--with-cma" configure flag, do we have to
 pass any options to "mpicc" when compiling/linking an mpi application to 
 use
 cma?
>>> 
>>> No. CMA should work out of the box. You appear to have a setup I haven't
>>> yet tested. It doesn't have yama nor does it have the PR_SET_PTRACER
>>> prctl. Its quite possible there are no restriction on ptrace in this
>>> setup. Can you try changing the following line at
>>> opal/mca/btl/vader/btl_vader_component.c:370 from:
>>> 
>>> bool cma_happy = false;
>>> 
>>> to
>>> 
>>> bool cma_happy = true;
>>> 
>> 
>> ok! (as of the officiel release, this is line 386.)
>> 
>>> and let me know if that works. If it does I will update vader to allow
>>> CMA in this configuration.
>> 
>> Yep!  It now works perfectly.  Testing with
>> https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_bandwidth.c, on my
>> own computer (dual Xeon), I have this:
>> 
>> Without CMA:
>> 
>> ***Message size:  100 *** best  /  avg  / worst (MB/sec)
>>   task pair:0 -1:8363.52 / 7946.77 / 5391.14
>> 
>> with CMA:
>>   task pair:0 -1:9137.92 / 8955.98 / 7489.83
>> 
>> Great!
>> 
>> Now I have to bench my real application... ;-)
>> 
>> Thanks!
>> 
>> Eric
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/02/26355.php
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/02/26356.php



Re: [OMPI users] Help on getting CMA works

2015-02-19 Thread Eric Chamberland

On 02/19/2015 03:53 PM, Nathan Hjelm wrote:


Great! I will add an MCA variable to force CMA and also enable it if 1)
no yama and 2) no PR_SET_PTRACER.


cool, thanks again!



You might also look at using xpmem. You can find a version that supports
3.x @ https://github.com/hjelmn/xpmem . It is a kernel module +
userspace library that can be used by vader as a single-copy mechanism.

In benchmarks it performs better than CMA but it may or may not perform
better with a real application.

See:

http://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy


ok, I will look (and relay the information to colleagues).

Thanks,

Eric



Re: [OMPI users] Help on getting CMA works

2015-02-19 Thread Nathan Hjelm

Great! I will add an MCA variable to force CMA and also enable it if 1)
no yama and 2) no PR_SET_PTRACER.

You might also look at using xpmem. You can find a version that supports
3.x @ https://github.com/hjelmn/xpmem . It is a kernel module +
userspace library that can be used by vader as a single-copy mechanism.

In benchmarks it performs better than CMA but it may or may not perform
better with a real application.

See:

http://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy

-Nathan

On Thu, Feb 19, 2015 at 03:32:43PM -0500, Eric Chamberland wrote:
> On 02/19/2015 02:58 PM, Nathan Hjelm wrote:
> >On Thu, Feb 19, 2015 at 12:16:49PM -0500, Eric Chamberland wrote:
> >>
> >>On 02/19/2015 11:56 AM, Nathan Hjelm wrote:
> >>>
> >>>If you have yama installed you can try:
> >>
> >>Nope, I do not have it installed... is it absolutely necessary? (and would
> >>it change something when it fails when I am root?)
> >>
> >>Other question: In addition to "--with-cma" configure flag, do we have to
> >>pass any options to "mpicc" when compiling/linking an mpi application to use
> >>cma?
> >
> >No. CMA should work out of the box. You appear to have a setup I haven't
> >yet tested. It doesn't have yama nor does it have the PR_SET_PTRACER
> >prctl. Its quite possible there are no restriction on ptrace in this
> >setup. Can you try changing the following line at
> >opal/mca/btl/vader/btl_vader_component.c:370 from:
> >
> >bool cma_happy = false;
> >
> >to
> >
> >bool cma_happy = true;
> >
> 
> ok! (as of the officiel release, this is line 386.)
> 
> >and let me know if that works. If it does I will update vader to allow
> >CMA in this configuration.
> 
> Yep!  It now works perfectly.  Testing with
> https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_bandwidth.c, on my
> own computer (dual Xeon), I have this:
> 
> Without CMA:
> 
> ***Message size:  100 *** best  /  avg  / worst (MB/sec)
>task pair:0 -1:8363.52 / 7946.77 / 5391.14
> 
> with CMA:
>task pair:0 -1:9137.92 / 8955.98 / 7489.83
> 
> Great!
> 
> Now I have to bench my real application... ;-)
> 
> Thanks!
> 
> Eric
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/02/26355.php


pgp4qiwgVsc3t.pgp
Description: PGP signature


Re: [OMPI users] Help on getting CMA works

2015-02-19 Thread Eric Chamberland

On 02/19/2015 02:58 PM, Nathan Hjelm wrote:

On Thu, Feb 19, 2015 at 12:16:49PM -0500, Eric Chamberland wrote:


On 02/19/2015 11:56 AM, Nathan Hjelm wrote:


If you have yama installed you can try:


Nope, I do not have it installed... is it absolutely necessary? (and would
it change something when it fails when I am root?)

Other question: In addition to "--with-cma" configure flag, do we have to
pass any options to "mpicc" when compiling/linking an mpi application to use
cma?


No. CMA should work out of the box. You appear to have a setup I haven't
yet tested. It doesn't have yama nor does it have the PR_SET_PTRACER
prctl. Its quite possible there are no restriction on ptrace in this
setup. Can you try changing the following line at
opal/mca/btl/vader/btl_vader_component.c:370 from:

bool cma_happy = false;

to

bool cma_happy = true;



ok! (as of the officiel release, this is line 386.)


and let me know if that works. If it does I will update vader to allow
CMA in this configuration.


Yep!  It now works perfectly.  Testing with 
https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_bandwidth.c, on 
my own computer (dual Xeon), I have this:


Without CMA:

***Message size:  100 *** best  /  avg  / worst (MB/sec)
   task pair:0 -1:8363.52 / 7946.77 / 5391.14

with CMA:
   task pair:0 -1:9137.92 / 8955.98 / 7489.83

Great!

Now I have to bench my real application... ;-)

Thanks!

Eric



Re: [OMPI users] Help on getting CMA works

2015-02-19 Thread Nathan Hjelm
On Thu, Feb 19, 2015 at 12:16:49PM -0500, Eric Chamberland wrote:
> 
> On 02/19/2015 11:56 AM, Nathan Hjelm wrote:
> >
> >If you have yama installed you can try:
> 
> Nope, I do not have it installed... is it absolutely necessary? (and would
> it change something when it fails when I am root?)
> 
> Other question: In addition to "--with-cma" configure flag, do we have to
> pass any options to "mpicc" when compiling/linking an mpi application to use
> cma?

No. CMA should work out of the box. You appear to have a setup I haven't
yet tested. It doesn't have yama nor does it have the PR_SET_PTRACER
prctl. Its quite possible there are no restriction on ptrace in this
setup. Can you try changing the following line at
opal/mca/btl/vader/btl_vader_component.c:370 from:

bool cma_happy = false;

to

bool cma_happy = true;

and let me know if that works. If it does I will update vader to allow
CMA in this configuration.

-Nathan


pgp8k5J9uaK7b.pgp
Description: PGP signature


Re: [OMPI users] Help on getting CMA works

2015-02-19 Thread Eric Chamberland


On 02/19/2015 11:56 AM, Nathan Hjelm wrote:


If you have yama installed you can try:


Nope, I do not have it installed... is it absolutely necessary? (and 
would it change something when it fails when I am root?)


Other question: In addition to "--with-cma" configure flag, do we have 
to pass any options to "mpicc" when compiling/linking an mpi application 
to use cma?


Thanks,

Eric



echo 1 > /proc/sys/kernel/yama/ptrace_scope

as root.

-Nathan

On Thu, Feb 19, 2015 at 11:06:09AM -0500, Eric Chamberland wrote:

By the way,

I have tried two others things:

#1- I launched it as root:

mpiexec --mca mca_btl_vader_single_copy_mechanism cma --allow-run-as-root
-np 2 ./hw

#2- Found this 
(http://askubuntu.com/questions/146160/what-is-the-ptrace-scope-workaround-for-wine-programs-and-are-there-any-risks)
and tried this:

sudo setcap cap_sys_ptrace=eip /tmp/hw

On both RedHat 6.5 and OpenSuse 12.3 and still get the same error message!!!
:-/

Sorry, I am not a kernel expert...

What's wrong?

Thanks,

Eric

On 02/18/2015 04:48 PM, Éric Chamberland wrote:


Le 2015-02-18 15:14, Nathan Hjelm a écrit :

I recommend using vader for CMA. It has code to get around the ptrace
setting. Run with mca_btl_vader_single_copy_mechanism cma (should be the
default).

Ok, I tried it, but it gives exactly the same error message!

Eric


-Nathan

On Wed, Feb 18, 2015 at 02:56:01PM -0500, Eric Chamberland wrote:

Hi,

I have configured with "--with-cma" on 2 differents OS (RedHat 6.6 and
OpenSuse 12.3), but in both case, I have the following error when
launching
a simple mpi_hello_world.c example:

/opt/openmpi-1.8.4_cma/bin/mpiexec --mca btl_sm_use_cma 1 -np 2 /tmp/hw
--

WARNING: Linux kernel CMA support was requested via the
btl_vader_single_copy_mechanism MCA variable, but CMA support is
not available due to restrictive ptrace settings.

The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.

   Local host: compile
--

Hello world from process 0 of 2
Hello world from process 1 of 2
[compile:23874] 1 more process has sent help message
help-btl-vader.txt /
cma-permission-denied
[compile:23874] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all
help / error messages

After I googled the subject, it seems there is a kernel parameter to
modify,
but I can't find it for OpenSuse 12.3 or RedHat 6.6...

Here is the "config.log" issued from RedHat 6.6...

http://www.giref.ulaval.ca/~ericc/ompi_bug/config.184_cma.gz

Thanks,

Eric
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/02/26339.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/02/26342.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/02/26351.php




Re: [OMPI users] Help on getting CMA works

2015-02-19 Thread Nathan Hjelm

If you have yama installed you can try:

echo 1 > /proc/sys/kernel/yama/ptrace_scope

as root.

-Nathan

On Thu, Feb 19, 2015 at 11:06:09AM -0500, Eric Chamberland wrote:
> By the way,
> 
> I have tried two others things:
> 
> #1- I launched it as root:
> 
> mpiexec --mca mca_btl_vader_single_copy_mechanism cma --allow-run-as-root
> -np 2 ./hw
> 
> #2- Found this 
> (http://askubuntu.com/questions/146160/what-is-the-ptrace-scope-workaround-for-wine-programs-and-are-there-any-risks)
> and tried this:
> 
> sudo setcap cap_sys_ptrace=eip /tmp/hw
> 
> On both RedHat 6.5 and OpenSuse 12.3 and still get the same error message!!!
> :-/
> 
> Sorry, I am not a kernel expert...
> 
> What's wrong?
> 
> Thanks,
> 
> Eric
> 
> On 02/18/2015 04:48 PM, Éric Chamberland wrote:
> >
> >Le 2015-02-18 15:14, Nathan Hjelm a écrit :
> >>I recommend using vader for CMA. It has code to get around the ptrace
> >>setting. Run with mca_btl_vader_single_copy_mechanism cma (should be the
> >>default).
> >Ok, I tried it, but it gives exactly the same error message!
> >
> >Eric
> >
> >>-Nathan
> >>
> >>On Wed, Feb 18, 2015 at 02:56:01PM -0500, Eric Chamberland wrote:
> >>>Hi,
> >>>
> >>>I have configured with "--with-cma" on 2 differents OS (RedHat 6.6 and
> >>>OpenSuse 12.3), but in both case, I have the following error when
> >>>launching
> >>>a simple mpi_hello_world.c example:
> >>>
> >>>/opt/openmpi-1.8.4_cma/bin/mpiexec --mca btl_sm_use_cma 1 -np 2 /tmp/hw
> >>>--
> >>>
> >>>WARNING: Linux kernel CMA support was requested via the
> >>>btl_vader_single_copy_mechanism MCA variable, but CMA support is
> >>>not available due to restrictive ptrace settings.
> >>>
> >>>The vader shared memory BTL will fall back on another single-copy
> >>>mechanism if one is available. This may result in lower performance.
> >>>
> >>>   Local host: compile
> >>>--
> >>>
> >>>Hello world from process 0 of 2
> >>>Hello world from process 1 of 2
> >>>[compile:23874] 1 more process has sent help message
> >>>help-btl-vader.txt /
> >>>cma-permission-denied
> >>>[compile:23874] Set MCA parameter "orte_base_help_aggregate" to 0 to
> >>>see all
> >>>help / error messages
> >>>
> >>>After I googled the subject, it seems there is a kernel parameter to
> >>>modify,
> >>>but I can't find it for OpenSuse 12.3 or RedHat 6.6...
> >>>
> >>>Here is the "config.log" issued from RedHat 6.6...
> >>>
> >>>http://www.giref.ulaval.ca/~ericc/ompi_bug/config.184_cma.gz
> >>>
> >>>Thanks,
> >>>
> >>>Eric
> >>>___
> >>>users mailing list
> >>>us...@open-mpi.org
> >>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>Link to this post:
> >>>http://www.open-mpi.org/community/lists/users/2015/02/26339.php
> >
> >___
> >users mailing list
> >us...@open-mpi.org
> >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >Link to this post:
> >http://www.open-mpi.org/community/lists/users/2015/02/26342.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/02/26351.php


pgpqgDxaFnCcb.pgp
Description: PGP signature


  1   2   3   >