Re: [OMPI users] High errorcode message

2021-01-29 Thread Jeff Squyres (jsquyres) via users
It's somewhat hard to say without more information.

What is your app doing when it calls abort?


On Jan 29, 2021, at 8:49 PM, Arturo Fernandez via users 
mailto:users@lists.open-mpi.org>> wrote:

Hello,
My system is running CentOS8 & OpenMPI v4.1.0. Most stuff is working fine but 
one app is aborting with:
MPI_ABORT was invoked on rank 7 in communicator MPI_COMM_WORLD
with errorcode 1734831948.
The other 23 MPI ranks also abort. I'm a bit confused by the high error code. 
Does it mean anything specific or should I focus on something else?
Thanks.




--
Jeff Squyres
jsquy...@cisco.com



[OMPI users] High errorcode message

2021-01-29 Thread Arturo Fernandez via users

Hello,
My system is running CentOS8 & OpenMPI v4.1.0. Most stuff is working fine
but one app is aborting with:
MPI_ABORT was invoked on rank 7 in communicator MPI_COMM_WORLD
with errorcode 1734831948.
The other 23 MPI ranks also abort. I'm a bit confused by the high error
code. Does it mean anything specific or should I focus on something else?
Thanks.


Re: [OMPI users] Binding blocks of processes in round-robin fashion

2021-01-29 Thread Ralph Castain via users
Okay, I can't promise when I'll get to it, but I'll try to have it in time for 
OMPI v5.


On Jan 29, 2021, at 1:30 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote:

 
 Hi Ralph,
 
 It would be great to have it for load balancing issues. Ideally one could do 
something like --bind-to:N where N is the block size, 4 in this case.
 
 mpirun -np 40  --map-by ppr:40:node  --bind-to core:4  
 
 I think it would be interesting to have it. Of course, I can always use srun 
but not all system run Slurm.
 
 
 
Of course, you could fake it out even today by breaking it into multiple 
app-contexts on the cmd line. Something like this (again, shortening it to just 
two nodes):
 

 
 
mpirun --map-by node --rank-by slot --bind-to core --np 8 myapp : --np 8 myapp 
: --np 8 myapp : --np 8 myapp : --np 8 myapp
 

 
 It is a valid options, tedious for large number of nodes though. 
 
 Thanks!
 



Re: [OMPI users] Debugging a crash

2021-01-29 Thread Gilles Gouaillardet via users
Diego,

the mpirun command line starts 2 MPI task, but the error log mentions
rank 56, so unless there is a copy/paste error, this is highly
suspicious.

I invite you to check the filesystem usage on this node, and make sure
there is a similar amount of available space in /tmp and /dev/shm (or
other filesystem if you use a non standard $TMPDIR

Cheers,

Gilles

On Fri, Jan 29, 2021 at 10:50 PM Diego Zuccato via users
 wrote:
>
> Hello all.
>
> I'm having a problem with a job: if it gets scheduled on a specific node
> of our cluster, it fails with:
> -8<--
> --
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --
> [str957-mtx-10:38099] *** Process received signal ***
> [str957-mtx-10:38099] Signal: Segmentation fault (11)
> [str957-mtx-10:38099] Signal code: Address not mapped (1)
> [str957-mtx-10:38099] Failing at address: 0x7f98cb266008
> [str957-mtx-10:38099] [ 0]
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f98ca553730]
> [str957-mtx-10:38099] [ 1]
> /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7f98c8a99936]
> [str957-mtx-10:38099] [ 2]
> /lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x7f98c8a82733]
> [str957-mtx-10:38099] [ 3]
> /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7f98c8a995b4]
> [str957-mtx-10:38099] [ 4]
> /lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7f98c8bdc46e]
> [str957-mtx-10:38099] [ 5]
> /lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7f98c8b9488d]
> [str957-mtx-10:38099] [ 6]
> /lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x7f98c8b50d7c]
> [str957-mtx-10:38099] [ 7]
> /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x7f98c8c3afe4]
> [str957-mtx-10:38099] [ 8]
> /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x7f98c946d656]
> [str957-mtx-10:38099] [ 9]
> /lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x7f98ca2c111a]
> [str957-mtx-10:38099] [10]
> /lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x7f98cae1ce62]
> [str957-mtx-10:38099] [11]
> /lib/x86_64-linux-gnu/libmpi.so.40(MPI_Init+0x6e)[0x7f98cae4b17e]
> [str957-mtx-10:38099] [12] Arepo(+0x3940)[0x561b45905940]
> [str957-mtx-10:38099] [13]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7f98ca3a409b]
> [str957-mtx-10:38099] [14] Arepo(+0x3d3a)[0x561b45905d3a]
> [str957-mtx-10:38099] *** End of error message ***
> --
> mpiexec noticed that process rank 56 with PID 37999 on node
> str957-mtx-10 exited on signal 11 (Segmentation fault).
> --
> slurmstepd-str957-mtx-00: error: *** JOB 12129 ON str957-mtx-00
> CANCELLED AT 2021-01-28T14:11:33 ***
> -8<--
> [I cut out the other repetitions of the stack trace for brevity.]
>
> The command used to launch it is:
> mpirun --mca mpi_leave_pinned 0 --mca oob_tcp_listen_mode listen_thread
> -np 2 --map-by socket Arepo someargs
>
> The same job, when scheduled to run on another node, works w/o problems.
> For what I could check, the nodes are configured the same (actually
> installed from the same series of scripts and following the same
> procedure: it was a set of 16 nodes and just one is giving troubles).
> I tried with simpler MPI codes and could not reproduce the error. Other
> users are using the same node w/o problems with different codes.
> Packages are the same on all nodes. I already double-checked that kernel
> module config is the same and memlock is unlimited.
> Any hint where to look?
>
> Tks.
>
> --
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786


[OMPI users] Debugging a crash

2021-01-29 Thread Diego Zuccato via users
Hello all.

I'm having a problem with a job: if it gets scheduled on a specific node
of our cluster, it fails with:
-8<--
--
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
[str957-mtx-10:38099] *** Process received signal ***
[str957-mtx-10:38099] Signal: Segmentation fault (11)
[str957-mtx-10:38099] Signal code: Address not mapped (1)
[str957-mtx-10:38099] Failing at address: 0x7f98cb266008
[str957-mtx-10:38099] [ 0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f98ca553730]
[str957-mtx-10:38099] [ 1]
/usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7f98c8a99936]
[str957-mtx-10:38099] [ 2]
/lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x7f98c8a82733]
[str957-mtx-10:38099] [ 3]
/usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7f98c8a995b4]
[str957-mtx-10:38099] [ 4]
/lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7f98c8bdc46e]
[str957-mtx-10:38099] [ 5]
/lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7f98c8b9488d]
[str957-mtx-10:38099] [ 6]
/lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x7f98c8b50d7c]
[str957-mtx-10:38099] [ 7]
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x7f98c8c3afe4]
[str957-mtx-10:38099] [ 8]
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x7f98c946d656]
[str957-mtx-10:38099] [ 9]
/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x7f98ca2c111a]
[str957-mtx-10:38099] [10]
/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x7f98cae1ce62]
[str957-mtx-10:38099] [11]
/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Init+0x6e)[0x7f98cae4b17e]
[str957-mtx-10:38099] [12] Arepo(+0x3940)[0x561b45905940]
[str957-mtx-10:38099] [13]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7f98ca3a409b]
[str957-mtx-10:38099] [14] Arepo(+0x3d3a)[0x561b45905d3a]
[str957-mtx-10:38099] *** End of error message ***
--
mpiexec noticed that process rank 56 with PID 37999 on node
str957-mtx-10 exited on signal 11 (Segmentation fault).
--
slurmstepd-str957-mtx-00: error: *** JOB 12129 ON str957-mtx-00
CANCELLED AT 2021-01-28T14:11:33 ***
-8<--
[I cut out the other repetitions of the stack trace for brevity.]

The command used to launch it is:
mpirun --mca mpi_leave_pinned 0 --mca oob_tcp_listen_mode listen_thread
-np 2 --map-by socket Arepo someargs

The same job, when scheduled to run on another node, works w/o problems.
For what I could check, the nodes are configured the same (actually
installed from the same series of scripts and following the same
procedure: it was a set of 16 nodes and just one is giving troubles).
I tried with simpler MPI codes and could not reproduce the error. Other
users are using the same node w/o problems with different codes.
Packages are the same on all nodes. I already double-checked that kernel
module config is the same and memlock is unlimited.
Any hint where to look?

Tks.

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786


Re: [OMPI users] Binding blocks of processes in round-robin fashion

2021-01-29 Thread Luis Cebamanos via users
Hi Ralph,

It would be great to have it for load balancing issues. Ideally one
could do something like --bind-to:N where N is the block size, 4 in this
case.

mpirun -np 40  --map-by ppr:40:node  --bind-to core:4 

I think it would be interesting to have it. Of course, I can always use
srun but not all system run Slurm.


> Of course, you could fake it out even today by breaking it into
> multiple app-contexts on the cmd line. Something like this (again,
> shortening it to just two nodes):
>
> mpirun --map-by node --rank-by slot --bind-to core --np 8 myapp : --np
> 8 myapp : --np 8 myapp : --np 8 myapp : --np 8 myapp
>
It is a valid options, tedious for large number of nodes though.

Thanks!